# TruthLens - Data Collection

TruthLens is a project developed for the BSc. Computer Science (Data Science) Final Project (CM3070) at the University of London. TruthLens is based on the Fake News Detection template. 

## Project Objectives
The primary objective of this project is to build a two-stage pipeline for misinformation classification:

1. Binary classification (Stage 1): Distinguish between real news and misinformation using the ISOT dataset. This ensures robust detection at the first stage, leveraging an established dataset.
2. Multi-class classification (Stage 2): Further classify content identified as misinformation into one of four categories, based on an adaption of Molina et al.’s taxonomy. A custom dataset will support this nuanced classification.

The scope of the project is limited to text-based, English language content, explicitly excluding images and videos. A user interface will also be developed, enabling users to input articles or URLs and receive classification results.

A secondary objective is to enhance the explainability of classification results, aiming to provide users with interpretable insights into why content was classified in a particular way.

The project aims for high accuracy and reliability, with measurable performance goals. Ethical considerations, including bias mitigation and responsible dataset usage, will guide the design and implementation of the pipeline.

## Custom dataset generation
As outlined in the previous section, the second stage of the pipeline relies upon a custom dataset, labelled with categories from the Molina et al. Misinformation Taxonomy. These classes are summarised in the table below. The aim of this stage is to create a balanced dataset with 400 pieces of content for each of the 4 categories. The 4 categories chosen are: fabricated content, polarised content, satire, commentary.

| Misinformation Type | Characteristics | Example |
|:--------------|:---------------|:-------|
| Fabricated content | Completely false content created with the intent to deceive.| Fake reports of events that never occurred; entirely false claims about public figures |
|Polarised content |True events or facts presented selectively to promote a biased narrative, often omitting critical context. |Partisan news articles highlighting one side of a political argument while ignoring counterpoints.|
|Satire |Content intended to entertain or provoke thought through humour, exaggeration, or irony. Often misunderstood. |Satirical articles from outlets like “The Onion” being shared as if they are factual news.|
|*Misreporting* | *Incorrect information shared unintentionally, often due to errors or lack of verification.* | *A news outlet incorrectly reporting election results due to early or inaccurate data.*|
|Commentary |Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding. |Editorials or blogs expressing subjective opinions without substantial evidence.|
|*Persuasive information* |*Content designed to persuade or influence the audience, often including marketing and propaganda.* |*Politically motivated propaganda campaigns, advertisements disguised as objective news articles.*|
|*Citizen journalism* | *User-generated content that may lack professional journalistic standards, leading to error or bias.* |*Social media posts about breaking news that spread unverified or incorrect details.*|

Data will be scrapped from relevant websites or sources for each category, then manually reviewed to ensure that it fits the category. Relevant features and labelling guidelines can be found for each category below.

In [30]:
#Imports and helper functions
import requests
import json
from bs4 import BeautifulSoup
import csv
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 8000)
import re
import string
import nltk
import ftfy
from nltk.corpus import stopwords
from datetime import datetime
import unicodedata

def preprocess_text(text):
    """
        Preprocesses a given text string by applying the following steps:
        1. Converts the text to lowercase.
        2. Removes punctuation marks.
        3. Tokenizes the text into individual words.
        4. Removes stopwords (common words that add little value to classification tasks).

        Parameters:
        ----------
        text : str
            The input text string to preprocess.

        Returns:
        -------
        str
            The cleaned and preprocessed text, with tokens joined back into a single string.
    """
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = [word for word in text.split() if word not in stop_words]
    return ' '.join(tokens)

def scrape_multiple_articles(urls, scrape_function):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.
    scrape_function: string
        The name of the function we will use to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped daa from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_function(url)
        articles.append(article)
    return pd.DataFrame(articles)

def clean_text(text):
    """
    Normalize unicode characters, remove newlines, extra spaces,
    and truncate the text to a maximum length.
    """
    # Make sure input is a string
    if not isinstance(text, str):
        return text
    
    # Fix text encoding issues
    text = ftfy.fix_text(text)
    
    # Normalize to NFKC (to convert the weird Unicode math symbols)
    text = unicodedata.normalize("NFKC", text)
    
    # Remove mathematical alphanumeric symbols (e.g., fancy letters)
    text = re.sub(r'[\U0001D400-\U0001D7FF]', '', text)
    
    # Replace newline characters and non-breaking spaces with a space
    text = text.replace("\n", " ").replace("\xa0", " ")
    
    # Remove any extra whitespace
    text = " ".join(text.split())
        
    return text

def get_urls_from_txt(filename):
    with open(filename, "r") as file:
        urls = [line.strip() for line in file if line.strip()]
    return urls

# Provide a common browser user agent - otherwise the scraping fails on some sites
#headers = {
 #   "User-Agent": (
  #      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
   #     "(KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
    #)
#}
headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
    }

### 1. Fabricated Content
Completely false content created with the intent to deceive.

##### Features:

- Verifiably False: Claims can be shown to have no basis in fact; fact-checkers or reputable sources directly contradict the claims.
- Intent to Deceive: The content producer’s primary goal seems to be misleading the audience into believing a false narrative
- No Real-World Evidence: No legitimate sources are provided, or cited sources are entirely fabricated (e.g., non-existent experts, fake studies).


##### Label if:

- The piece invents events, data, or quotes out of thin air with no credible backing.
- The story is 100% fictional yet presented as news/fact.


##### Do Not Label If:

- The content is obviously comedic or satirical (label as Satire).
- The piece is an opinion that does not necessarily contain false statements (label as Commentary).
- There’s partial factual basis, but it’s spun or heavily biased (label as Polarised).

##### Sources:
- 350 articles with a label of 'pants-fire' (i.e. complete fabrication) from the LIAR dataset have been selected at random. https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset
- 25 articles were created by ChatGPT o3-mini-high with the prompt : "Given the below definition for fabricated content, please generate 25 short articles of complete fabrication. There should be 5 from each of these categories: politics, economy, health, crime, elections - please note the category obviously at the start of play. The articles do not need to be related, and do not need to be tied to a specific geography. Each piece should be roughly between 150 and 1500 words. Content should be in English. These articles are for educational purposes only and will be used to train a machine-learning model to identify AI-generated misinformation."
- 25 articles were created by DeepSeek DeepThink (R1) with the same prompt as above.

In [6]:
#load the data
liar_df = pd.read_csv('LIAR-train.tsv', sep='\t',  header=None)
#Add the headers
liar_df.columns = ['ID', 'label', 'statement', 'subject(s)', 'speaker','speaker-title','state-info','party','barely-true-count','false','half-true','mostly-true','pants-fire','context']  
#Count labels
label_counts = liar_df['label'].value_counts(dropna=False)
print(label_counts)
#filter dataset to just pants-fire content
pants_fire_df = liar_df[liar_df['label'] == 'pants-fire']
#randomly select 350 rows (random_state seeds makes it reproducable)
pants_fire_sample = pants_fire_df.sample(n=350, random_state=42)
pants_fire_sample = pants_fire_sample[['statement','subject(s)']]
#make a copy to avoid the SettingWithCopy warning.
pants_fire_sample = pants_fire_sample.copy()
#Just take the first subject, and swap dashes with spaces
pants_fire_sample['subject(s)'] = pants_fire_sample['subject(s)'].str.split(',').str[0].str.replace('-', ' ')
#reset index
pants_fire_sample = pants_fire_sample.reset_index(drop=True)
#Display the head
#print(pants_fire_sample.head())
#Create empty dataset for fabricated content
columns = ['title', 'text', 'site', 'date', 'category', 'class', 'url']
fabricated_dataset = pd.DataFrame(columns=columns)
#prepare the LIAR data for the new df
temp_df = pd.DataFrame({
    'title': "",  
    'text': pants_fire_sample['statement'],
    'site': "Liar Database",  
    'date': "February 4th",  
    'category': pants_fire_sample['subject(s)'], 
    'class': "fabricated",
    'url': "https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset"
})
fabricated_dataset = pd.concat([fabricated_dataset, temp_df], ignore_index=True)
print(fabricated_dataset.head())

half-true      2114
false          1995
mostly-true    1962
true           1676
barely-true    1654
pants-fire      839
Name: label, dtype: int64
  title  \
0         
1         
2         
3         
4         

                                                                                                                                        text  \
0                                                                      Ed Perlmutter voted for Viagra for rapists paid for with tax dollars.   
1                                                        More than 3,000 homicides were committed by illegal aliens over the past six years.   
2  (U.S. Reps.) Paul Ryan, Sean Duffy and Reid Ribble are shutting down town hall meetings, or making their constituents pay to attend them.   
3                                                   Earlier this year, [John DePetro] was accused of sexually assaulting a female co-worker.   
4                                                                  

In [7]:
#Chat GPT output
chatgpt_output = [
    ['Shadow Council Manipulates Global Policies','In a stunning revelation that has rocked the global political landscape, insiders have claimed that a secretive group known as the Shadow Council has been orchestrating international policy decisions behind the scenes for over two decades. According to anonymous sources within high-ranking government agencies, this clandestine network meets in undisclosed locations to determine the fate of nations—manipulating economic strategies, military deployments, and diplomatic relations with ruthless precision. One whistleblower, insisting on anonymity, described the council’s gatherings as “a blend of high-level intrigue and covert power plays,” where a handful of elite figures shape world events. Despite a complete lack of verifiable evidence and rebuttals from reputable fact-checkers, rumors persist, stirring suspicion among citizens and igniting fierce debates over the true nature of global governance. Critics demand full transparency, while supporters dismiss the claims as a political witch hunt.','politics'],
    ['The Rise of the Phantom Leader','Reports from undisclosed insiders claim that a mysterious figure—referred to only as the Phantom Leader—has quietly assumed control over several national governments simultaneously. Allegedly emerging from the shadows of political instability, this enigmatic leader is said to have orchestrated a series of covert meetings with influential policymakers in dimly lit back rooms. Documents leaked to a dubious online forum (purportedly authored by “deep-state informants”) suggest that the Phantom Leader’s network manipulates legislative agendas and even directs covert military operations without public knowledge. Despite lacking any credible sources, conspiracy theorists assert that this figure’s influence is so pervasive that major policy shifts and election outcomes across multiple continents can be traced back to secret communications from this single mastermind. Authorities have repeatedly denied any such existence, dismissing the reports as politically motivated fabrications. Nonetheless, the legend of the Phantom Leader continues to fuel debates on the hidden forces controlling modern politics.','politics'],
    ['Fabricated Faction’s Covert Conspiracy Exposed', 'A series of anonymous memos circulating on obscure internet forums have allegedly uncovered a covert conspiracy orchestrated by a fabricated political faction known as the “Crimson Syndicate.” According to these unverified documents, the Crimson Syndicate comprises influential lawmakers and shadowy advisors who purportedly manipulate policy decisions for their own benefit. The memos detail clandestine meetings held in remote, undisclosed locations where members allegedly decide on major legislative actions and orchestrate political scandals to discredit rivals. One particularly detailed memo claims that the Syndicate once arranged the downfall of an entire government cabinet simply to advance its own secret agenda. While no reputable news outlet or independent fact-checker has confirmed any part of this narrative, the circulating documents have nevertheless sparked heated discussions on social media and among fringe political groups. Detractors dismiss the allegations as wild fabrications, yet the growing fascination with the Crimson Syndicate continues to captivate those eager to believe in hidden, all-powerful networks in the realm of politics.','politics'],
    ["Hidden Alliances in the Corridors of Power", "In a narrative that sounds more like a spy thriller than reality, leaked “insider” communications now allege the existence of hidden alliances among top government officials across multiple nations. According to these fabricated sources, secret meetings held in luxurious, undisclosed locations have resulted in a series of backdoor pacts designed to bypass democratic processes. The documents—a mixture of blurry photographs, cryptic emails, and questionable “eyewitness” accounts—claim that leaders from different countries conspire to ensure their mutual benefit, often at the expense of public welfare. One source, identified only by the pseudonym “Nightwatcher,” asserts that these covert gatherings have influenced major global events, including trade wars and military escalations, with no oversight or accountability. Critics argue that the evidence is entirely manufactured, yet the tale of clandestine pacts behind closed doors continues to circulate widely, feeding the narrative that true power resides not in publicly elected officials but in secret alliances hidden in the corridors of power.",'politics'],
    ['Government Secrets Unveiled by Whistleblowers','A series of explosive revelations by alleged whistleblowers has ignited controversy in political circles, with claims that top government officials have been concealing vast amounts of classified information from the public. According to the fabricated reports, these officials have engaged in a deliberate cover-up involving the manipulation of policy outcomes, the redirection of public funds, and the orchestration of international incidents to distract from domestic mismanagement. Leaked documents—purportedly obtained through highly secretive channels—purport to show that covert committees operate independently of elected representatives, making decisions that affect millions without any form of public scrutiny. One anonymous source claimed that a secret “Transparency Committee” exists solely to fabricate narratives that support the government’s agenda. Although no hard evidence has emerged and fact-checkers have thoroughly debunked the claims, the idea of hidden governmental secrets continues to resonate with a segment of the population that remains deeply distrustful of official narratives.','politics'],
    ["The Secret Currency That Could Change the World","In a story that has captured the imaginations of economic conspiracy theorists everywhere, unverified sources have alleged the existence of a hidden global currency engineered by an elite cabal. Dubbed the “Phantom Coin,” this secret form of money is said to circulate only among the world’s most powerful financial institutions, outside the purview of national regulators and international oversight. According to the fabricated narrative, the Phantom Coin was created as a tool to destabilize traditional monetary systems and establish a new world order based on clandestine financial control. Anonymous insiders claim that this digital currency is already in circulation, used to facilitate secret transactions and influence economic policies in various countries. Although mainstream economists and banking authorities have dismissed these assertions as complete fabrications, the idea of a hidden monetary system has fueled heated debates on online forums and in underground economic circles. Critics argue that the concept of a global secret currency is nothing more than a cleverly constructed myth, designed to incite distrust in established financial institutions.","economy"],
    ["Hidden Financial Collapse Engineered by Elites", "A series of unsubstantiated leaks has sent shockwaves through online financial communities, with claims that a shadowy group of financial elites has orchestrated a deliberate plan to trigger a global economic collapse. According to the fabricated documents circulating on encrypted messaging apps, these elites have been manipulating stock markets, interest rates, and international trade agreements for decades. The conspiracy theory posits that by engineering an economic meltdown, this cabal intends to seize control of national economies and install a new financial system under their complete dominion. One anonymous source, signing off as “The Insider,” detailed how secret meetings held in undisclosed locations allegedly laid out a blueprint for the collapse, complete with timelines and specific economic indicators. Despite the lack of any credible evidence or confirmation from reputable institutions, the narrative has taken on a life of its own among conspiracy theorists. Mainstream experts have categorically rejected the theory, but the allure of a hidden hand guiding global economics continues to fascinate and alarm many.","economy"],
    ["The Phantom of Market Manipulation", "Recent reports from mysterious online channels claim that a covert group known as “The Phantom” has been secretly manipulating global stock markets to create artificial booms and busts. According to the entirely fabricated story, this group uses advanced algorithms and insider access to orchestrate dramatic swings in market values, ensuring that only a select few reap enormous profits while ordinary investors suffer severe losses. Leaked “evidence” in the form of blurry screenshots and unverified emails purport to show that major market indices were deliberately skewed during key financial events over the past decade. Conspiracy theorists argue that The Phantom’s actions are responsible for several notorious market crashes, though no reputable financial analyst or regulator has ever confirmed any such scheme. Instead, critics dismiss the allegations as modern folklore—a narrative designed to explain the often unpredictable nature of global finance. Nonetheless, the legend of The Phantom continues to spread across online communities, feeding the belief that the markets are secretly rigged by unseen forces.","economy"],
    ["Underground Trade Networks Revealed", "Whispers of an extensive underground trade network have recently surfaced in a series of online posts that claim to expose an elaborate system of secret deals and backdoor negotiations among multinational corporations and government insiders. According to these unverified accounts, this network—codenamed “Black Route”—is responsible for smuggling vital commodities, manipulating supply chains, and controlling prices on a global scale. Fabricated documents allegedly leaked from an anonymous source suggest that Black Route operates with near-impunity, using encrypted communication channels and hidden financial conduits to bypass international regulations. The posts detail intricate schemes involving fake invoices, shadow accounts, and secret meetings in remote warehouses. Despite the dramatic narrative, established trade experts and economic analysts have refuted the existence of any such network, attributing the claims to baseless rumors and intentional disinformation. Yet the allure of a hidden economic underworld continues to captivate the imaginations of those distrustful of global financial systems, even as authorities dismiss the reports as entirely fictitious.","economy"],
    ["Fake Economic Forecasts Uncovered by Investigative Reporters", "A recently circulated dossier—allegedly compiled by a group of rogue investigative reporters—claims that some of the world’s most prominent economic forecasts are nothing but elaborate fabrications designed to mislead the public and manipulate market sentiment. According to this entirely fabricated report, influential think tanks and financial institutions have conspired to publish optimistic projections despite mounting evidence of economic instability. The dossier asserts that behind the scenes, a secretive committee of experts is altering data and suppressing negative information to maintain investor confidence and secure lucrative financial deals. Interviews quoted in the dossier (all of which are untraceable) describe how internal memos instruct analysts to “spin the narrative” during times of economic downturn. While mainstream economists and reputable media outlets have thoroughly debunked these claims, the narrative has found traction on social media and alternative news platforms. Critics argue that the story is a carefully constructed piece of misinformation aimed at sowing distrust in established economic institutions and their published forecasts.","economy"],
    ["Miracle Cure or Conspiracy? The Hidden Truth", "A bombshell report circulating in underground online communities alleges that a revolutionary “miracle cure” for multiple chronic illnesses has been discovered in a secret laboratory—but that the cure is being deliberately suppressed by powerful pharmaceutical interests. According to the fabricated narrative, researchers at a clandestine facility in an undisclosed location developed a treatment that can reverse conditions ranging from diabetes to autoimmune disorders. Whistleblowers (whose identities remain unverified) claim that multinational drug companies, fearing a catastrophic loss of profits, have conspired to bury the research and discredit its findings. Detailed, though entirely fictional, documents describe covert meetings between executives and government regulators where plans were hatched to discredit the miracle cure through a series of “controlled clinical failures.” Despite the dramatic claims, no reputable medical journal or regulatory agency has ever confirmed the existence of such a treatment. Nonetheless, the story has ignited fervent discussion among alternative health advocates and conspiracy theorists, with many calling for independent investigations into the alleged cover-up.","health"],
    ["Government-Secret Vaccines and the Hidden Agenda", "In a narrative that has rapidly spread through social media channels, unverified sources now claim that several governments have developed secret vaccines—not to combat diseases, but to implant mind-control nanobots in unsuspecting citizens. According to the entirely fabricated account, these covert vaccines were engineered in hidden research facilities and are being distributed covertly alongside routine immunizations. Insiders allege that top government officials have conspired with shadowy biotech firms to implement the program as part of a larger scheme to control public behavior and suppress dissent. Detailed but unverifiable “leaks” include diagrams of nanobot technology and supposed internal memos outlining the project’s phases. Public health authorities and independent scientists have dismissed the claims as absurd and lacking any empirical basis, yet the narrative continues to fuel heated debates online. The story has become a rallying cry for those suspicious of government overreach, even as experts warn that the entire account is a complete fabrication designed to stoke fear and mistrust in established health institutions.","health"],
    ["The Fabricated Epidemic That Never Was","A recent series of posts on fringe health forums has claimed that an epidemic sweeping the globe is nothing more than a carefully orchestrated fabrication by international health agencies. According to these unfounded accounts, the so-called outbreak of a novel virus was deliberately invented to enforce draconian public health measures and expand governmental control over citizens’ lives. Fabricated “data” presented in the posts—including manipulated graphs and fake expert testimonies—purports to show that infection rates were grossly exaggerated and that the virus was engineered in a laboratory as part of a secret experiment. Despite overwhelming evidence to the contrary provided by reputable global health organizations, the narrative has gained traction among communities predisposed to distrust official sources. Detractors of the mainstream narrative argue that the epidemic is a hoax designed to justify unprecedented restrictions on personal freedom. While scientists and public health experts have thoroughly debunked the claims, the rumor of a fabricated epidemic persists as one of the most controversial and persistent conspiracy theories in the health sphere.", "health"],
    ["Shadow Health Organization Controlling Treatments", "A startling claim emerging from anonymous online sources alleges that a clandestine organization, known only as the “Global Health Directorate,” is secretly controlling all aspects of medical research and treatment protocols worldwide. According to this entirely fabricated narrative, the Directorate operates behind the scenes to determine which diseases receive funding for research and which innovative treatments are suppressed to protect the interests of certain pharmaceutical giants. Leaked “internal documents” (all of which have been debunked by experts) supposedly reveal that this shadow group manipulates clinical trial outcomes and deliberately withholds breakthrough therapies from the public. One supposed insider explained that the Directorate’s ultimate goal is to monopolize global healthcare, ensuring that all new treatments funnel profits exclusively to a handful of powerful corporations. While mainstream scientists and healthcare professionals have dismissed these claims as pure fantasy, the idea of a hidden health organization continues to resonate with individuals suspicious of modern medicine and its regulatory framework.","health"],
    ["The Pseudoscientific Breakthrough that Shocked Experts", "A recent online buzz has centered on reports of a pseudoscientific breakthrough—allegedly discovered by a renegade group of researchers—that claims to reverse aging and cure terminal illnesses in a single treatment. According to the fabricated account, the breakthrough involves a complex combination of gene therapy and nanotechnology, developed in a secret laboratory hidden beneath an abandoned industrial complex. The story goes on to assert that leading medical experts worldwide have been silenced or coerced into keeping the discovery under wraps, with influential institutions allegedly colluding to protect lucrative existing treatments. Detailed but entirely spurious “research notes” and blurry laboratory images have been circulated to support the claim. Despite the sensational nature of the announcement, no peer-reviewed studies or independent verifications exist to corroborate the story. Health authorities and renowned scientists have categorically refuted the claims, calling the report a dangerous piece of misinformation designed to exploit public hopes for miraculous cures.","health"],
    ["The Mastermind Behind the Global Heist","In an astonishing tale that sounds straight out of a blockbuster movie, unverified sources have alleged the existence of a criminal mastermind orchestrating a series of sophisticated heists across multiple continents. Dubbed “The Phantom Thief” by underground circles, this enigmatic figure is said to have masterminded daring robberies targeting high-security financial institutions and luxury art galleries alike. According to the fabricated narrative, The Phantom Thief utilizes an intricate network of accomplices and cutting-edge technology to bypass state-of-the-art security systems. Leaked “confidential reports” (entirely unverifiable) claim that the mastermind’s operations are so meticulously planned that law enforcement agencies remain one step behind at every turn. One anonymous tipster described a dramatic scene in which the criminal escaped using an elaborate series of decoys and underground tunnels. Despite widespread media interest and online chatter, no credible evidence supports these claims, and authorities have repeatedly dismissed the story as an elaborate fabrication. Nonetheless, the legend of The Phantom Thief continues to capture the imagination of both criminals and crime enthusiasts.", "crime" ],
    ["The Cyber Syndicate and the Digital Black Market", "A series of posts on dark web forums has recently brought attention to an alleged cyber syndicate that is said to run an expansive digital black market, controlling vast networks of hackers and cybercriminals. According to the entirely fabricated narrative, this syndicate—known only as “Digital Dominion”—is responsible for orchestrating large-scale data breaches, identity thefts, and even orchestrated cyberattacks on critical infrastructure. The story details how Digital Dominion supposedly recruits skilled hackers from around the globe, providing them with state-of-the-art tools and secretive training in return for a share of their illicit profits. Leaked “evidence” in the form of anonymized chat logs and cryptic online transactions has fueled speculation about the syndicate’s influence over modern cybercrime. Despite the dramatic claims, no law enforcement agency has confirmed the existence of such an organization, and cybersecurity experts have dismissed the narrative as a myth designed to instill fear. Nevertheless, the notion of a centralized cybercriminal empire continues to spread rapidly among online communities, adding fuel to debates about digital security.","crime"],
    ["Fake Evidence Links Celebrity to Crime Ring", " scandalous claim has emerged from questionable online sources alleging that a world-renowned celebrity is secretly involved in an international crime ring. According to the fabricated report, the star—whose identity remains deliberately vague—has been linked through a series of doctored documents, manipulated photographs, and untraceable phone recordings to an underground network involved in money laundering and arms trafficking. The narrative suggests that the celebrity’s public persona is merely a facade, carefully crafted to conceal a far more sinister involvement in organized crime. Despite the sensational nature of the claim, independent investigations by reputable outlets have found no supporting evidence, and multiple fact-checking organizations have debunked the story as a fabrication. Nonetheless, the tale has ignited fervent debate on social media, with supporters insisting that the “evidence” is being suppressed by powerful interests intent on protecting high-profile figures. Critics argue that the entire narrative is a calculated piece of misinformation designed to smear reputations and distract from real criminal investigations.","crime"],
    ["The Underworld’s Hidden Code of Silence", "Whispers from the criminal underworld have given rise to a fabricated narrative detailing an alleged secret code of silence that binds organized crime groups across continents. According to the entirely unverified account, this so-called “Code of Shadows” mandates that members of illicit organizations adhere to strict rules of non-disclosure about their operations, with severe—and entirely invented—consequences for any breaches. Leaked “testimonies” from anonymous ex-criminals (whose identities cannot be confirmed) claim that this code is enforced through a network of vigilante enforcers operating outside the law. The report further asserts that this clandestine system has allowed crime syndicates to thrive, coordinating complex operations such as international drug trafficking, cybercrimes, and high-stakes robberies without fear of exposure. While law enforcement officials have long acknowledged the existence of informal codes among criminals, no evidence has ever substantiated the detailed version of the Code of Shadows described in these posts. Nevertheless, the story has captured the public’s imagination, fueling both fear and fascination with the hidden rules of the underworld.","crime"],
    ["Alleged Supernatural Connection in Organized Crime", "In a bizarre twist that has stirred both intrigue and skepticism, unverified online sources claim that an otherworldly element is at work within organized crime circles. According to this fabricated narrative, certain notorious crime families are rumored to have forged secret pacts with mysterious, supernatural entities in exchange for uncanny success in their illicit endeavors. The story describes eerie rituals performed in abandoned warehouses under moonlit skies, where members of these crime families allegedly invoke ancient forces to secure their power and evade capture by authorities. Detailed but entirely fictional accounts include descriptions of cryptic symbols, mysterious chants, and inexplicable phenomena witnessed during criminal operations. While no credible evidence or expert testimony supports any supernatural involvement in crime, the tale has rapidly spread through niche internet forums and alternative news sites. Skeptics dismiss the narrative as pure fantasy, yet its persistence highlights the human tendency to weave extraordinary explanations around the most enigmatic and frightening aspects of criminal life.","crime"],
    ["AI-Driven Vote Rigging Uncovered", "A startling claim emerging from shadowy online sources alleges that recent elections in multiple countries were manipulated using advanced artificial intelligence systems designed specifically for vote rigging. According to the entirely fabricated report, an underground network of tech experts and political operatives developed a sophisticated AI program that could alter digital ballots and even sway public opinion through targeted disinformation campaigns. Leaked “internal communications” (all of which lack any credible origin) detail how this system was deployed during key electoral cycles to produce results favorable to a select group of political elites. The report asserts that the AI not only manipulated vote counts but also fabricated evidence of voter fraud to justify its interference. While election officials and independent watchdog organizations have vehemently denied any involvement of AI in vote manipulation, the narrative has ignited fierce debates online. Critics dismiss the allegations as modern myth-making, yet the idea of a clandestine, algorithm-driven election interference continues to find an audience among those distrusting traditional democratic processes.","elections"],
    ["Hidden Ballots and Phantom Voters", "In a narrative that has rapidly spread through fringe political blogs, unverified sources now claim that a secretive scheme involving hidden ballots and phantom voters was implemented during recent national elections. According to the fabricated account, shadow operatives allegedly inserted fake ballots into the voting system, and entirely fictitious voter identities were created to sway the outcome in key districts. Detailed but entirely false “evidence”—including manipulated voter records and doctored official documents—purports to show that thousands of non-existent citizens were added to the rolls, tipping the scales in favor of a prearranged result. The story asserts that these phantom voters were registered using advanced data manipulation techniques, and that the entire operation was coordinated from undisclosed headquarters by a covert group of political insiders. While election authorities have consistently maintained that voter registration and ballot counting were conducted transparently and accurately, the rumor of hidden ballots and ghost voters continues to spark controversy. Skeptics warn that such narratives are dangerous fabrications intended to undermine public confidence in democratic institutions.","elections"],
    ["The Secret Software Behind Election Fraud", "A fabricated exposé circulating on alternative news platforms alleges that the integrity of recent elections was compromised by secret software embedded in voting machines. According to the entirely unverified report, a rogue group of software engineers collaborated with political operatives to install a hidden program capable of altering vote totals in real time. Detailed descriptions in the report claim that the software was designed to target specific precincts and switch votes from opposition candidates to those favored by the conspirators. Anonymous “insiders” (whose identities remain unverifiable) provided screenshots and technical schematics to support the claim, though none have been authenticated by independent experts. Election officials have categorically denied any tampering with voting equipment, yet the narrative persists among groups that already harbor deep suspicions of electoral fraud. While mainstream media and cybersecurity professionals dismiss the allegations as a digital-age urban legend, the story has fueled ongoing debates about the security and transparency of modern voting systems.","elections"],
    ["International Conspiracy Alters Poll Results", "A sensational claim has emerged from obscure online communities alleging that an international conspiracy was behind the manipulation of poll results in recent elections. According to this fabricated narrative, a coalition of foreign intelligence agencies and political operatives conspired to alter vote tallies through covert operations, including hacking voting systems and deploying disinformation campaigns across borders. The report—supported by entirely unsubstantiated “leaked” documents and cryptic video footage—purports to show that the conspiracy was orchestrated from hidden command centers located in various parts of the world. Proponents of the story argue that the altered results were part of a larger plan to undermine national sovereignty and install puppet governments. Despite repeated denials from official election commissions and independent international observers, the narrative continues to gain traction among segments of the public already inclined to distrust electoral processes. Experts, however, maintain that there is no credible evidence of any such international interference, calling the story a complete fabrication designed to stoke geopolitical paranoia.","elections"],
    ["The Unseen Hand Steering Democracy", "In a final explosive installment of fabricated election conspiracies, unverified online sources claim that an unseen hand has been subtly steering democratic outcomes for decades. According to the entirely fictitious report, a secret cabal of influential figures—including undisclosed political advisors, wealthy oligarchs, and covert intelligence operatives—has been manipulating voter sentiment and election results from behind the scenes. Detailed accounts in the report describe how this cabal allegedly funds political campaigns, engineers media narratives, and even tampers with ballot-counting machines to ensure desired outcomes. The narrative is supported by a series of dubious “eyewitness” testimonies and manipulated documents that purport to reveal a long-standing pattern of covert intervention in democratic processes. While election experts and historians have long refuted such sweeping claims, the story of an unseen hand controlling the destiny of nations continues to resonate with those disillusioned by modern politics. Critics argue that the tale is a carefully constructed piece of misinformation intended to erode public trust in the very foundations of democracy.", "elections" ]
]
#DeepSeek output
deepseek_output = [
    ["World Leader Secretly Funds Alien Technology Research, Leaked Docs Claim", "A classified dossier allegedly reveals that the leader of a major European nation diverted €800 million in public defense funds to a clandestine extraterrestrial tech program. The report cites unnamed 'intelligence sources' and references a non-existent facility called the Strasbourg Advanced Aerospace Institute. Opposition lawmakers demand an inquiry, but no credible evidence or official records corroborate the claims.","politics"],
    ["Pacific Island Nation Declares War on Canada Over Fishing Rights","Fabricated diplomatic cables suggest the tiny nation of Maritana threatened military action against Canada after accusing it of illegal deep-sea trawling. The story cites a fake Global Oceanic Rights Council report and a fictional Maritanian official, 'Minister Koa Tala.' No such dispute exists, and Maritana is not a recognized country.","politics" ],
    ["UN Secretary-General Arrested for Espionage, Anonymous Sources Allege","An unsigned blog post claims UN Secretary-General António Guterres was detained in a joint CIA-Russian operation for “selling state secrets.” The article quotes a non-existent Interpol warrant and a phantom “Geneva Security Summit” attendee. The UN has debunked the story as baseless.", "politics" ],
    ["Secret Pact Reveals Plans to Merge US and Mexico into ‘North American Union’", "A fringe website alleges that President Biden and Mexican President López Obrador signed a treaty to dissolve borders by 2028, backed by a forged document bearing fake seals. The hoax cites the Institute for Continental Integration, a think-tank that does not exist.", "politics"],
    ["Australia’s PM Found to Have Dual Citizenship of Nonexistent Country", "A viral post asserts Australian Prime Minister Anthony Albanese holds citizenship in Veridia, a fictional island nation. The claim relies on a Photoshopped passport and a fabricated International Citizenship Database. Australia’s government confirmed no such country is recognized.", "politics"],
    ["Gold to Be Outlawed as Global Currency Shift Begins", "A conspiracy outlet warns that the World Financial Authority (WFA) will ban private gold ownership in 2024 to pave the way for a digital currency. The WFA is fictitious, and no such policy proposals exist from real entities like the IMF or World Bank.","economy"],
    ["China’s Economy Collapses After ‘Black Monday’ Stock Market Crash", "A fake news site reports a 40% plunge in Shanghai stocks, attributing it to a nonexistent “debt contagion.” The article quotes “economist Dr. Li Wen” and the Asian Fiscal Stability Board, both fabricated. Actual Chinese markets showed no unusual activity.","economy"],
    ["New Global Tax Will Charge 5% on All Online Purchases, UN Announces", "A fraudulent press release claims the UN approved a universal e-commerce tax to fund climate initiatives. The document references a non-existent resolution (UN-2023/TCX) and a fake UN department. The UN confirmed no such tax exists.","economy"],
    ["Bitcoin Banned Worldwide After Secret G7 Summit","A clickbait article alleges G7 leaders agreed to criminalize cryptocurrency transactions under a clandestine “Operation Blockchain Shield.” The story cites anonymous “G7 insiders” and a phantom regulatory body, the Global Digital Asset Bureau.", "economy" ],
    ["Major Bank Announces Negative Interest Rates for Savings Accounts", "A spoofed JPMorgan Chase memo circulating online claims the bank will charge customers 2% annually to hold savings. The fake notice includes a forged signature from CEO Jamie Dimon. JPMorgan denied the policy, calling it “pure fiction.”","economy"],
    ["Vaccine Causes Infertility in 70% of Recipients, Fake Study Claims", "A debunked paper from the fabricated European Medical Review falsely links COVID-19 vaccines to infertility. The study, authored by “Dr. Erik Voss” of the nonexistent Berlin Institute of Virology, cites anonymous patient surveys. No peer-reviewed research supports this.","health"],
    ["Deadly ‘Zombie Virus’ Spreads in South America, WHO Warns", "A hoax article describes a fictional outbreak of Cortazar Virus, causing “aggressive behavior and organ failure.” It quotes a fake WHO spokesperson, “Dr. Amara Singh,” and a non-existent health alert. The WHO confirmed no such virus exists.", "health"],
    ["Common Food Additive Linked to Brain Damage, Researchers Find", "A pseudoscientific blog claims titanium dioxide (E171) causes dementia, citing a fake Global Food Safety Alliance study. The article invents a “Dr. Lisa Tanaka” and misrepresents actual E171 research, which finds no such link.","health"],
    ["Cancer Cure Discovered in Mushroom Species, But Big Pharma Suppresses It", "A conspiracy theory alleges the Amazonian Luminescent Shroom eliminates tumors but is withheld by drug companies. The story references a nonexistent Journal of Oncology Advances paper and a fictional researcher, “Dr. Carlos Mendez.”","health"],
    ["Airborne HIV Variant Detected in Europe, Health Officials Panic", "A fabricated alert from the European Center for Disease Prevention warns of a mutated HIV strain spreading via coughs. The report cites fake case numbers in Spain and France. Actual HIV cannot transmit through airborne particles.","health"],
    ["AI-Powered Robots Commit $1 Billion Bank Heist in Singapore","A tabloid claims hackers deployed autonomous robots to loot the United Pacific Bank. The story quotes a nonexistent CyberCrime Task Force investigator, “Agent Maya Lee,” and provides no police reports or bank confirmations.","crime"],
    ["Serial Killer Targets Only Left-Handed Victims, Police Say","A false crime bulletin describes a fictional murderer dubbed “The Southpaw Slayer” operating in Argentina. The article cites a phantom Buenos Aires police captain, “Inspector Raul Gomez,” and fabricated victim profiles. No such cases exist.","crime"],
    ["Prison Break in Norway: 200 Inmates Escape Using Underground Tunnels","A sensationalized piece alleges inmates at Oslo’s Fjord Maximum Security Prison dug a mile-long tunnel. The story references a fake warden, “Henrik Dahl,” and includes AI-generated images of the escape. Norwegian authorities confirmed all prisons are secure.","crime"],
    ["Mafia Develops Invisible Drug Smuggling Drones, Interpol Warns","A conspiracy site reports organized crime groups using “cloaked drones” to traffic narcotics. The article cites an unnamed Interpol official and a nonexistent tech firm, StealthCargo Inc. Interpol denied issuing any such alert.","crime"],
    ["Celebrity Chef Kidnapped by Vegan Extremist Group","A fake news outlet claims Gordon Ramsay was abducted by the Vegan Justice Army demanding he stop serving meat. The hoax includes a forged ransom note and a fabricated spokesperson, “Ava Green.” Ramsay’s team confirmed his safety.","crime"],
    ["Voter Fraud Uncovered: 1 Million Fake Ballots Found in Warehouse","A far-right blog alleges a warehouse in Texas stored counterfeit ballots for the 2024 election. The story cites an anonymous “election integrity group” and a fake address. State officials confirmed no ballots were found.","elections"],
    ["Candidate Drops Out After Secret Love Child Scandal","A smear article accuses a fictional Canadian MP, “Sarah Clarke,” of concealing a child with a staffer. The piece uses a doctored photo and quotes a nonexistent tabloid, Ottawa Exposé. Clarke is not a real politician.","elections"],
    ["Foreign Agents Infiltrate Voting Systems in 12 States, FBI Claims","A disinformation campaign alleges Russian hackers compromised U.S. voting machines. The article references a fake FBI memo and a phantom cybersecurity firm, ShieldWall Analytics. The FBI stated no breaches occurred.","elections"],
    ["AI-Generated Candidate Wins Local Election in New Zealand", "A satirical claim repurposed as news states an AI persona named “Polly” won a mayoral race in Christchurch. The story cites a fake election commission report and a non-existent AI company, VoteBot Inc. No such election took place.","elections"],
    ["Election Postponed Indefinitely Due to ‘National Security Threat’","A fabricated emergency decree alleges India delayed its 2024 elections over a bogus “terror plot.” The article quotes a fictional home ministry official, “Rajeev Kapoor,” and provides no credible sources. Indian officials denied the claim.","elections"]
]
#add static values to track where each came from
chatgpt_output = [item + ["ChatGPT","chatgpt.com"] for item in chatgpt_output]
deepseek_output = [item + ["DeepSeek","chat.deepseek.com"] for item in deepseek_output]

#combine the outputs from the different LLMs
llm_output = chatgpt_output + deepseek_output

#create a DataFrame from the list
llm_df = pd.DataFrame(llm_output, columns=['title', 'text', 'category','site','url'])

#add the constants
llm_df['date'] = "February 4th"
llm_df['class'] = "fabricated"

#reorder the columns
llm_df = llm_df[['title', 'text', 'site', 'date', 'category', 'class', 'url']]

#concatenate all fabricated data
fabricated_dataset = pd.concat([fabricated_dataset, llm_df], ignore_index=True)
print (len(fabricated_dataset))

#store to CSV
fabricated_dataset.to_csv("fabricated_articles.csv", index=False)

400


### 2. Polarised content
Polarised content is true events or facts selectively presented to promote a biased narrative, often omitting critical context.

##### Features:
- Partial Truth: The piece is based on a real event, statistic, or quote.
- Omission / Distortion: The content emphasizes certain facts while ignoring or minimizing others, creating a skewed impression.
- Strong Bias: The language or framing clearly supports one political, ideological, or partisan stance, rather than offering balanced coverage.

##### Label if:
- The article references real events but uses them to push a strong, one-sided narrative.
- The content focuses on data or testimonies that bolster a specific stance while disregarding contradictory evidence.
- The tone or style is heavily partisan and attempts to sway opinion by selective fact usage rather than outright fabrication.

##### Do Not Label if:
- The core facts are outright false (label as Fabricated).
- It is primarily personal opinion or commentary without strong factual references (label as Commentary).
- It is purely an attempt at persuasion or advertising without misrepresenting an event (label as Persuasive).

##### Sources:
- The Conservative Woman (UK, Right leaning) https://www.conservativewoman.co.uk/ (100)
- The Canary (UK, Left leaning) https://www.thecanary.co/uk/ (100)
- Breitbart (USA, Right leaning) https://www.breitbart.com/ (100)
- Daily Kos (USA, Left leaning) https://www.dailykos.com/ (100)

**The Conservative Woman**

Articles were scraped from the weekly "Our Top Ten Articles of the Week" series, starting from the January 11, 2025 edition (https://www.conservativewoman.co.uk/tcw-our-top-ten-articles-of-the-week-9/), ending on the February 22 edition.

A large number of articles were skipped. "Features" and "Family and Faith" articles were skipped as they are not news. Many of the other articles did not meet the criteria for labelling, instead falling under Commentary, for example: https://www.conservativewoman.co.uk/wind-turbines-and-a-voice-in-the-wilderness/ These were primarily recognised by a focus on "I" and "me" in the text.

In [31]:
def scrape_tcw_article(url):
    """
    Scrapes an article from a given URL on conservativewoman.co.uk and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised", 
        "url": url
    }
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        # Remove the trailing site name
        if article_data["title"].endswith(" - The Conservative Woman"):
            article_data["title"] = article_data["title"].replace(" - The Conservative Woman", "")
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Category
        yoast_script = soup.find("script", class_="yoast-schema-graph", type="application/ld+json")
        if yoast_script:
            try:
                data_json = json.loads(yoast_script.string)
                for node in data_json.get("@graph", []):
                    if node.get("@type") == "Article":
                        art_sec = node.get("articleSection", None)
                        if art_sec:
                            if isinstance(art_sec, list):
                                article_data["category"] = art_sec[0]
                            else:
                                article_data["category"] = art_sec
                        break
            except json.JSONDecodeError:
                print("Could not parse the JSON-LD correctly.")
        
        # Article copy
        content_div = soup.find("div", class_=lambda c: c and "td-post-content" in c)
        if content_div:
            # Collect paragraphs
            paragraphs = content_div.find_all("p")
            text_list = []
            for p in paragraphs:
                text = p.get_text(strip=True)
                # End before the donation paragraph
                if text.startswith("If you appreciated this article, perhaps you might consider making a donation"):
                    break  
                text_list.append(text)
            #join all paragraphs together
            full_text = " ".join(text_list).strip()
            # Remove web addresses using a regex
            full_text = re.sub(r'https?://\S+', '', full_text)    
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    return article_data

#load list of URLs
urls = get_urls_from_txt("conservativewoman.txt")
# Scrape articles and create a DataFrame
tcw_data_df = scrape_multiple_articles(urls, scrape_tcw_article)
# Store to CSV
tcw_data_df.to_csv("polarised_scraped_articles_tcw.csv", index=False)
# Print head 
tcw_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,Conferences of the smug are no match for Satan’s foot soldiers,"GOOD people who achieve squat in echo chambers with canapes – this was my initial reaction to the advent of two conferences focused on the crap state of the world. The first is thecurrent conference of the Alliance for Responsible Citizenship, otherwise known as ARC, foundedin June 2023. They meet in Londonistan. The second is the forthcoming Church and State conferencecoming soon to Brisbane. When funky acronyms, slick websites and bells and whistles arrive, you think: Hmm. Not to mention the comfortable on-stage couches and the being-among-friends vibe. And no room for 'deplorables' at £1,500 a throw, I'm told was the price of a ticket – with a £400 discount offered to people ARC wanted there, the rumour mill has it. The comfort of being right (in both senses of the term) without actually moving the needle. The contrast can be found in JD Vance, speaking truth to power in the den of his (and our) enemies in Munich. His harsh words about his audience of European globalist establishment grifters have gone around the world several times. They suggest bravura, intent, action, steely composure, truth and . . . Donald Trump, the figure that still makes many staid comfort-conservatives clench the buttocks. There is no greater exemplar of the performative conservative class than Greg Sheridan ofThe Australian. Peter Smith atThe New Catallaxynotes: '. . . he who no longer refers to Donald Trump as a despicable human being but who, so far as I can tell, has never apologised for so doing'. Indeed. Non-apologies count. Not in a good way. And are noted. These people are pathetic, really. They are content to be cerebral chin-strokers, while hoisting the white flag. Over and over again. Rod Dreher, a speaker at the ARC conference, summed up one of the key purposes of these events: 'Good morning from London, and the first day of the ARC conference. Big opening dinner for speakers and donors last night; as a speaker, I got to go. Was so, so great to see many old friends there . . . Just now, after breakfast, I walked out of the hotel restaurant, and there sat two of my favorite people in all the world . . . Oh happy day! Oh happy next three days!' Source: Rod Dreher emailed newsletter, February 19, 2025. Fellowship. I get it. All fine and dandy. We all need our mutual support mechanisms. But the line between 'fellowship' (Kemi Badenoch and Nigel Farage, really?) and a dry gulch, an echo chamber, a right-of-centre ghetto, is a fine line indeed. Wonderful speakers at these endless conferences – some of whom I know and like – use beautiful words to make absolutely no impact on their enemies and ours. Douglas Murray left the audience of 4,000 'speechless'. I am sure he did. He is very, very good at this. You might agree with just about everything he said. That isn't the point. To repeat, our enemies are domestic Marxist revolutionaries, post-modernists, woke automatons, Chinese communist imperialists, Muslim colonialists, globalists seated in Europe, corporate fascists, American neocons and British liberal progressives parading in conservative clothes. I am yet to hear a single international conferencer pull all of this together. To address the real meta-problems. I say wonderful speakers. Not all are wonderful, though. Any conference that gives Michael Gove a slot and an invitation cannot be taken seriously. He was a two-timing sneak of a politician and, far worse, a covid criminal. The multi-millionaire who appointed Gove as editor of theSpectatorhappens to be the co-founder and co-owner of ARC, Paul Marshall, who also ownsGB News.The mob who sacked Mark Steyn and Dan Wootton. Neither of these heroes has a place, nor has Kathy Gyngell, myTCWeditor. Bilderberger Niall Ferguson, inevitably. Anyone who is a friend (and biographer) of Henry Kissinger isn't a friend of mine. Beautiful speaker, though. The inevitable Jordan Peterson is there. Both are on ARC'sexclusive Advisory Board. RFK Jr recently said: 'NOTHING is gonna be off-limits.' Source: Ed Dowd's Telegram channel, February 19, 2025. He was referring specifically to chronic disease investigation, but his words refer to much, much more than this. They are a call to action. They resonate. RFK Jr is feared by all the bad guys. Utterly feared. As is Kash Patel. As is Tulsi Gabbard. These people are impactful. Consequential. Far more importantly, they are revolutionary. If our circumstances are as dire as the ARC team suggest, why aren't THESE people there? Oh, of course, they are too busy across the Atlantic, actually doing things. A report intheTelegraphsaw the group as a corrective to the 'chaos' and 'excesses' in American conservatism during the rise ofDonald Trump. Could that still be their real purpose? The ARC has been billed as a right-wing competitor of Davos, no doubt because of its 'global leaders space' recruitment strategy. But those who see here some sort of conference equivalence entirely miss the point. Davos ain't a talkfest. The people who attend the Swiss Alps shindig are bad actors with power who do actually run the world. They are Satan's foot soldiers. Pitting an opposing talkfest against them takes us nowhere. This is a category error. I worked for a quarter of a century in economic development, generally at the local and regional level. I attended many conferences. Endless conferences. You felt good being there. Among friends. Being part of the smart set. But the one great insult across the years was that these endless meet-ups of the great and good, with their well remunerated keynotes, was 'talkfest'. That was always the greatest fear of delegates at these things and of observers. What was it all for? It was, in essence, performative. Process. Part of the deal. It hasn't changed. The sins of the performative conservative conference class are sins of omission, not of commission. I have mentioned the speakers not at the London conference. What about the topics not addressed? Covid: Will any of the speakers there expose the truths daily reported in the alt-media? Will Michael Gove apologise for his covid sins? Will Paul Marshall explain or seek to justify his sacking of Mark Steyn and friends? Nope. Will Niall Ferguson admit that the elites of whom he is a paid-up member admit that his ilk is the problem? The whole problem. The key word is 'smug'. No one suffering under the yoke of tyranny likes smug. They like anger-driven action that will free the oppressed. The grieved outsiders currently feel far more comforted by the 'nothing is gonna be off limits' view of the world articulated so well by RFK Jr and by the sheer determination of those who say it than by champagne swilling, globetrotting conservatives feeling good about life in London. Or in Brisbane. It is time to put a bit of stick about, not just to glory in being right (in both senses of the term). It is an irony that one of the most discussed topics at the London conference has been JD Vance's speech in Munich. Will any of them get the deep point here?",The Conservative Woman,2025-02-21T01:20:00+00:00,News,Polarised,https://www.conservativewoman.co.uk/conferences-of-the-smug-are-no-match-for-satans-foot-soldiers/
1,Throwing the SAS under a bus,"THE relationship between the public, justice, the electorate and the law is getting more complex. Potentially it's been a worse one for the SAS, the Armed Forces and the Defence of the Realm. In an inquest that opened last year, acoroner has ruled that SAS soldiers did not need to use lethal forcein an operation in which four IRA terrorists were killed in 1992. A file is therefore being passed to the Director of Public Prosecutions. Unlesshe does the decent thing and resigns, Lord Hermer, Sir Keir Starmer's human rights lawyer pal, is unlikely to decide either that there is no case to answer (killing armed terrorists being the Army's job) or that it's so long ago that pursuing this is not in the public interest. Harmer might recuse himselfas one of his clients was former head of the IRA Gerry Adams. Don't hold your breath. Let's look at the facts, from the ground up. The rules of engagement for soldiers in Northern Ireland came from Parliament and were known as 'The Yellow Card' after the card carried and learned by rote by every soldier in the province. Soldiers in Northern Ireland had no more rights than a private citizen, so the provisions of the Yellow Card were pretty clear. Essentially soldiers were allowed to open fire on terrorists who were about to commit, were committing or had recently committed a terrorist act and there was no other way to stop them. Warnings were mandatory 'before opening fire unless . . . to do so would increase the risk of death or grave injury to you or any other person'. The Yellow Card never had statutory authority; soldiers who killed terrorists were subject to arrest and interrogation by local police to determine whether they had committed a crime. Read that again: any soldier who had just shot and killed a terrorist to protect you was almost immediately arrested on suspicion of manslaughter, or worse. Why? Because although the government sent him to uphold the law with a loaded rifle it didn't give him any special rights to use it. At one time during the Troubles soldiers were so concerned at the immediate consequences of shooting that they were failing to open fire at legitimate targets. The fundamental question for the soldier was 'Do I shoot now?' Confronting an armed terrorist, the soldier was making this decision under mortal peril and with massive time pressure. Arresting an armed terrorist – also in fear of his life – is far from straightforward. As there is no requirement in common or military law for soldiers recklessly to risk their life, taking the shot is usually the right answer. If they don't shoot the terrorist survives to kill, bomb and maim the public again. In this action four IRA men were returning to waiting getaway cars after an attack on the Coalisland police station; a total of ten IRA men were there. An SAS unit was waiting for them and they engaged the IRA men without warning. The four who attacked the police station were killed and three of the others were wounded. There is no doubt that the dead men were armed IRA terrorists. There is evidence that some of the dead were killed at close range when they were on the ground, having already been hit by multiple bullets. So what? The decision to kill them had been made when the first round was fired. Armed men are a lethal threat and, unlike in the movies, a hail of bullets seldom delivers instant death. A dying terrorist may have a grenade in his pocket, so he's still a threat to the soldier and his colleagues. A firefight and its aftermath are no place for the squeamish. The salient point is whether the 12 SAS soldiers opening fire believed they could find any way of arresting ten IRA men, at least four of whom were armed, at night in hostile territory. Clearly they thought not, hence they engaged. On the evidence before him the coroner thought differently, as is his right – indeed his duty.Last year he found that a similar action in 1991 which killed three IRA men was lawful. His independence is not in question, it's the law itself that is wrong. Now, 33 years later, the SAS men potentially face a criminal trial. Notwithstanding the coroner's opinion, that result seems wrong – 96 per cent of those polled by theTelegraphopposed it. At the time there was little sympathy for the IRA men outside the Irish Republican and Nationalist movements. The public were hugely supportive of the SAS and the armed forces then, as they are now. The law, as created by a selection of politicians of all parties over almost half a century, has failed. It's not just the elite SAS that are being failed, it's every serviceman and woman, and indeed the society they seek to defend. The deficiencies in the legal status of the combatants in the Ulster Troubles were known since it began. The failure to find an end to the arguments when peace came was rank moral cowardice that is all too typical of Westminster. In 2023 the Conservatives belatedly introduced the Northern Ireland Legacy Act to prevent this sort of prosecution. The Labour Party had a manifesto commitment to repealing it, because the High Court has found it in contradiction to, you guessed it, the European Convention on Human Rights. Hilary Benn is proud of introducing the repeal Act. When pressed, the likes of Hilary Benn, Lord Hermer and Two-Tier will say that the UK leaving the ECHR would send the wrong message to the world and damage our international reputation. They may go on and say that countries such as Russia aren't members of the ECHR. That entire argument is complete cant; it may make sense in a university debating club or over a glass of champagne in Matrix Chambers, but it's balderdash. The list of countries not in the ECHR incudes the United States, Singapore, Australia, New Zealand and Dubai – all top destinations for the wealthy fleeing the UK. One of the most powerful messages we sent to the world was when the SAS emerged from obscurity and ended the Iranian Embassy siege. The message was stark and clear: 'The UK does not deal with terrorists.' International terrorists steered well clear of us and the UK became a valued partner in the establishment of SAS-like units across the world. (Note for Lammy: that's what soft power looks like. It comes from having a very competent military that is feared globally, so other nations tread softly round your interests and seek your friendship.) Now Labour wants to throw the SAS and the rest of the armed forces under a legal bus rather than support them. It was a manifesto commitment in the general election, but no one noticed or kicked up about it. Why? Because the self-appointed intellectual elite wanted a left-wing coup – which is what the election turned out to be. Ex-KGB man President Putin must have laughed when the left triumphed here in July. Sixty years of sedition (the KGB's speciality) may have paid off too late for the Soviet Union, but it's done for us.",The Conservative Woman,2025-02-17T01:18:00+00:00,Democracy in Decay,Polarised,https://www.conservativewoman.co.uk/throwing-the-sas-under-a-bus/
2,Monkeypox mania continues to take its toll on common sense,"WARS are so inconvenient, especially when they hamper the efforts of the public health lobby to eradicate the latest killer disease. Recently in our favourite source of manufactured anxiety about all things pandemicky,Global Health Now, there was a link to an article inThe East Africanunder the heading 'Conflict in the eastern DRC [Democratic Republic of the Congo] hampers fight against mpox'. The source of the news report was the XinHua News Agency, the press outlet of the Chinese Communist Party. Where accurate reporting on viruses is concerned, surely nothing can be read into that. Proving that monkeypox continues to be an epidemiological nonentity, as we have been at pains to point out in these pages for several years, for examplehere, the report says that this year alone in Africa diagnoses with monkeypox have surpassed 9,959 (so is that 9,960?) and deaths total 85. That puts the infection fatality rate of monkeypox at less than 1 per cent and an infection rate across Sub Saharan Africa (population 1,243,107,741) at 0.000081 per cent. Even if the figures applied only to the DRC (population 102.3million) the infection rate is 0.0097 per cent. Juxtapose that with the fact that fighting in the Congo has alreadyclaimed 700 livesthis year. Of course, death from an infectious virus is no laughing matter and monkeypox looks like a nasty infection to contract. However, in the developed world it is almost totally confined to gay men. In fact in the DCR, where most cases are reported, over 83% of monkeypox casesare linked to sex workand disproportionately affects children and pregnant women. It should also be considered, in relation to figures related to monkeypox from the DRC, that malnutrition is endemic there which puts people at greater risk of succumbing. According to theGlobal Hunger Index, the DRC scores fifth with Somalia in first place.Pregnant women and children are at most risk of malnutritionin the DRC. Fit and healthy people experience mild symptoms with little to fear from monkeypox. Even our own NHS, no stranger to whipping us into a frenzy about viral infections as with the recent 'quad-demic' which seems to have left most of us standing,says that monkeypox'is usually mild and can get better within a few weeks without treatment'. People in the DRC need food yet the WHO response to monkeypox is to push vaccines which, they say, are 'at the heart' of their response. That presumably includes monkeypox vaccines for babies which are being rolled outwithout clinical trials. After all, what could possibly go wrong? GAVI, otherwise known asThe Vaccine Alliance, are responsible for the distribution of vaccines in the DRC. They get their funding from many sources and they are closely linked to the Bill & Melinda Gates Foundation which recently, along with the European Union, hosted a 'high level pledging summit' seeking funding from governments and 'Billanthropists' for the work of GAVI. GAVI, which hasreceived over $4.1billionfrom the Bill & Melinda Gates Foundation provides, along with the Bill & Melinda Gates Foundation,18 per cent of the funding of the World Health Organization. The Foundation has a permanent seat on the board of GAVI. Clearly, such funding does not come without a few chains attached. According to the Gates Foundation,vaccines have become their biggest investment, raising the question of whether Gates expects a financial return on his investment. With79 per cent of the Gates Foundation stock holdings being in biotech companies, which manufacture vaccines, the Foundation has a vested interest in pushing vaccines worldwide. They did this effectively during the covid fiasco. Perhaps Bill Gates has enough money and is simply addicted to the power that vast wealth and philanthropy can buy. One thing is for sure, as explained by Daniel Jupp in his bookGates of Hell: despite his self-proclaimed efforts to give away his fortune, Bill Gates miraculously seems to get no poorer. Whatever happens in the world, people like Bill Gates continue to profit from the cycle of hysteria induced over non-existent pandemics: Gates invests in vaccine manufacture; Gates funds GAVI and the WHO; the WHOapproves and pushes vaccines; GAVI ensures that the vaccines are distributed and governments buy them; vaccine companies make money; Gates (probably) earns a return on his investment; Gates invests in vaccine manufacture . . . repeat. Monkeypox is simply the latest vector for vaccine-derived profit and the poor people of the DRC are simply the latest victims of the scam. I repeat: the people of the DRC need food, not vaccines.",The Conservative Woman,2025-02-14T01:17:00+00:00,News,Polarised,https://www.conservativewoman.co.uk/monkeypox-mania-continues-to-take-its-toll-on-common-sense/
3,When Panama defeated Trump’s forebears,"WHETHER Donald Trump carries out his threat that the US willtake back control of the Panama Canal– a warning he reinforced in his inauguration speech on January 20 – remains to be seen. Being partially of Scottish descent, perhaps he should be wary of venturing there; 326 years ago, the Isthmus of Panama saw one of the most woeful chapters in Scottish history unfold: the Darien Scheme. The ill-fated project was Scotland's attempt to break away from England's economic dominance and establish its own overseas empire. A company of investors was formed to found a colony, which would be named Caledonia, near the Gulf of Darien on the Caribbean/Atlantic Coast of the isthmus, about 200 miles east of today's Panama Canal. The ambitious aim was to open a land link to the Pacific Ocean, around 50 miles away. The investors envisaged establishing a thriving trade route across the isthmus, with a town to be called New Edinburgh as the entrepôt where tariffs would be charged. Gold mines and fertile plantations would add to the riches, and the wealth would flow back to Scotland. Instead, the dream died a squalid death in the swampy, rain-lashed, disease-ridden jungles of the isthmus, along with around 2,000 would-be colonists. There are several excellent books about Darien. One of the most accessible and comprehensive accounts is John Prebble'sDarien: The Scottish Dream of Empire(first published in 1968 asThe Darien Disaster). Prebble explains how the scheme came about at the end of the 17th century, amid the Scottish establishment's simmering resentment of England. Although both nations had shared a monarch since 1603, Scotland, which still had its own parliament and laws, was very much the poor relation. It languished under England's Navigation Acts, which crippled its seaborne commerce, whilst bad harvests and drought brought famine. The country was in an ever-deepening slough of despond. The initial driving force behind the Darien Scheme was William Paterson, a wealthy Scottish-born financier and trader who co-founded the Bank of England in 1694. He helped to form the Company of Scotland Trading to Africa and the Indies to promote the project. However, its attempts to set up a base in London were thwarted by the East India Company and garnered the disapproval of the King, William III. William did not want to risk war with Spain, which claimed sovereignty over the Isthmus of Panama as part of its lucrative empire in the New World. In response, the Scottish company retreated to Edinburgh and made a public appeal for investors. The response was astonishing and, within weeks, around 1,500 Scots had raised £400,000, which was around 20 per cent of the wealth available in Scotland and equivalent to some £66million today. On July 14, 1698, five ships, carrying 1,200 pioneer colonists, set sail from Leith. The expedition was shambolic from the start. Beset by storms, fog and becalming, the fleet did not reach the Orkneys until early August. When it finally headed west into the Atlantic, there was a snag – no one knew exactly where Darien was. It was not until they reached the West Indies that they met up with an old buccaneer who agreed to guide them. Meanwhile, deaths began mounting from the bloody flux and fever. The fleet reached Darien on November 2 and, at first, it looked like the paradise they had envisaged. From an anchorage in a sheltered bay, the colonists went ashore and started clearing land for growing crops and building New Edinburgh in the form of wooden huts. However, planting proved difficult, and the tropical heat and incessant heavy rainfall made conditions intolerable (Darien is one of the wettest places on Earth). The local native Kuna (or Guna) tribes were friendly and brought gifts of plantains and fruit, but they were unwilling to buy the combs, mirrors and other trinkets the colonists had brought as trade goods. Meanwhile, King William had ordered the English colonies in North America and the Caribbean not to supply the Scots or give them help of any sort. Ships of other nations which occasionally anchored at Darien would only trade food and other supplies for gold, which the colonists did not have. To make matters worse, the leading colonists soon split into squabbling factions, the main division being between seamen and landsmen. The sailors lived aboard the ships in relative comfort and safety while the rest faced the brutal environment of the rainforest. Eventually, food began running out, as supplies that had been stored aboard the ships for months were rancid. Rationing was imposed, with colonists each getting just one pound of maggot-infested flour per week, but only those who could work were fed. As tropical fevers and malaria struck the starving hordes, deaths continued to mount; by Christmas Day, the total was 76 and, by March, there were 200 graves, with ten or 12 deaths per day. There was a brief moment of elation when Spanish troops were repelled in a jungle skirmish, but the fear of a stronger attack and the deteriorating conditions led to the colony being abandoned on June 18, 1699, leaving behind a cemetery with 400 graves. At least another 400 colonists died trying to get back to Scotland. Only one ship, theCaledonia, made it, arriving in November with just 300 men aboard. A second fleet of four ships, not knowing of the abandonment, had already been sent out to Darien three months earlier, carrying 1,300 people. Before they arrived on November 30, 160 had died en route. Expecting to be greeted by a thriving colony, one of the new arrivals wrote: 'We found nothing but a vast howling wilderness.' After coming under siege from the Spaniards, the colonists agreed to leave Darien, and it was finally abandoned on April 11, 1700. None of the ships returned to Scotland, with hundreds of those aboard dying en route to Jamaica, where they were refused help, or to the North American colonies. In all, around 1,000 are thought to have perished through sickness and shipwreck. Scotland's vision of a bounteous empire had turned to dust, and the recriminations were rancorous. The returned colonists were reviled as cowards, riots ensued, and those who had sunk their money into the scheme angrily counted the cost. Scapegoats were sought and in a scandalous act of vengeance, three innocent English sailors were hanged in Leith on a fabricated charge of piracy in April 1705. The collapse of the Darien Scheme is said to have helped push Scotland into signing the Acts of Union with England in 1707. One provision of the treaty was that Scotland would be paid a sum called 'The Equivalent' (£398,085) which would essentially repay the Darien investors their capital, plus five per cent interest. Today, New Edinburgh has been reclaimed by the jungle and the mangrove swamps; all that remains of it is a shallow defensive ditch dug by the colonists. There was some morsel of memory in that the bay where the fleet first arrived was known as Puerto Escocés (Scottish Harbour), but that has been renamed Sukunya Inabaginya in honour of a Panamanian general. However, all is not lost. In the nearby San Blas archipelago, the inhabitants have renamed theirsmall island Caledonia, in honour of the Scots who had arrived with such high hopes and left amid such heartbreaking failure.",The Conservative Woman,2025-02-01T01:19:00+00:00,News,Polarised,https://www.conservativewoman.co.uk/when-panama-defeated-trumps-forebears/
4,"Bravo, JD Vance, speaking up for real Europeans","We make no apologies for publishing a further response to JD Vance's seminal, possibly epoch-changing speech to EU leaders in Munich last Friday. On Sunday Dr Frederick Attenboroughrelayed the Vice President'sdevastating critique of the systematic suppression of dissent by Britain and its Europeanallies. YesterdaySean Walsh analysed Vance's 'massacre'of the Euro commissars and their British counterparts. Today, from the USA, we have a vote of thanks from British-born Bernard Carpenter along with his excoriation of the outraged legacy media's response. THANK you, thank you, thank you so much, Vice President Vance, for expressing what so many ordinary Europeans feel. Your recent speech at the Munich Security Conference might have upset a few elitists but gave great pleasure to those who love free speech and freedom of conscience on both sides of the Atlantic and offered hope to those feeling crushed by open borders and the social pathologies to which mass immigration gives rise. Thank you, too, for giving permission to those who lament the demise of their national cultures and the Western/Christian civilisation which once gave a sense of common heritage and belonging to a continent that created the modern world. The horror on the faces of those gathered to hear the speech was a sight for sore eyes, a delicious pleasure for those of us who are supposed to keep our mouths shut and go along with all the woke nonsense for fear of being labelled as one of the myriad terms ending in 'ist'. How lovely it was to see those smug, holier-than-thou, highly credentialled, gilt-edged globalist aristocrats looking so outraged. How dare a man who grew up in poverty in the Appalachian Rust Belt have the temerity to lecture so august a body of technocrats on subjects such as the freedom of citizens to criticise their lords and masters and reject the godless ideologies they foist on those they govern? But dare he did and many here in America and in Europe are loving him for it. Needless to say, not all welcomed his speech. The legacy media reacted as you would expect.'Vance shocks Europe', blasted America's newspaper of record, theNew York Times, accusing him of defending 'a divisive far-right political party in Germany', presumably AfD (Alternative for Germany), although Vance did not mention it by name. Incidentally, which Europe is Vance accused of shocking – anti-mass-immigration Eastern Europe, the Europe of men and women struggling to make a living, fearful to go out at night due to high rates of crime, people who fear being arrested for posting on social media or expressing their Christian faith, Europeans who resent the rewriting of their histories and are deeply troubled by the rubbishing of their distinctive cultures, Jews compelled to hide their identity on the streets of European capitals due to rising rates of anti-Semitism? Surely none of those just listed. It goes without saying that it has in mind the Europe of technocratic elites who are now determining the fate of a once great civilisation according to their godless visions of utilitarian dystopias 'made more sinister', to quote Churchill, 'and perhaps more protracted, by the lights of perverted science'. Elsewhere in the paper, Vance is accused of 'attacking a German consensus on Nazis and speech . . . decades-long approaches to political extremism that were designed to prevent another Hitler'. CNN seeks to indict him for dealing in 'half-truths' and trying to reignite popularism in Europe, as if it needs any help from American politicians, but seriously undermines its journalistic authority when it describes the electoral disaster that the British Conservative Party suffered last year as proof that populism is past its peak in a Europe which is 'a little wiser now' than it was during Trump's first administration. If the Conservative Party is an example of populism, then I'm in line to become the next living prophet of the Church of Jesus Christ of Latter-day Saints. When it comes to the legacy media and their negative reactions to Vance's Munich speech, I could go on and on. It seems to have offended all the right people, theGuardianleading with the hysterical headline:'JD Vance stuns Munich conference with blistering attack on Europe's leaders'. The BBC even used the adjective'weird'to attack the speech, dusting off a word which was used to attack Trump and Vance last summer. It didn't work then, and it's unlikely to work now. The BBC at least got it right when it said that Vance'accused European governments of retreating from their values, and ignoring voter concerns on migration and free speech', but failed to consider the veracity of his accusation. Before I finish, allow me to list some of my thoughts on the now-infamous speech and why I and many others loved it. I loved it because Vance mentioned specific examples of how ordinary men and women have been negatively affected by their governments' authoritarian tendencies, people who have fallen foul of hate-speech laws that should have no place in a democratic society. Of course, such a list could become very long indeed, as we've seen over the last 20 years or so. But by naming people such as Adam Smith-Connor, found guilty of praying outside an abortion clinic in my home town of Bournemouth, and alluding to other examples of those prosecuted for exercising their God-given rights to express themselves freely, Vance has put a face on this Orwellian evil. Above all, I loved his speech because he articulated what millions of Europeans are thinking but are too cowed by the increasingly authoritarian administrative state to express themselves for fear of the very real possibility of punitive consequences. And, my dear elites, stop using fear of another Hitler as an excuse to stifle free speech. Arresting people who criticise feminism or tell tasteless jokes will do nothing to prevent the return of fascism, but unrestricted immigration and draconian restrictions on free speech will drive decent men and women into the arms of those on the supposedly 'far right' that the left claims to be the greatest existing threat to liberal democracy at a time when Europe is importing millions of immigrants who despise everything liberal democracy claims to stand for. The spell has been broken, and we must be grateful to Vice President Vance for making that clear to those assembled in Munich to hear him speak so plainly and so forcefully.",The Conservative Woman,2025-02-19T01:19:00+00:00,Editor's Pick,Polarised,https://www.conservativewoman.co.uk/bravo-jd-vance-speaking-up-for-real-europeans/


In [16]:
len(tcw_data_df)

100

**The Canary**

Articles have been scraped from the UK section of The Canary (https://www.thecanary.co/uk/) from newest to oldest. Article date range is January 7th to January 29th 2025. Five articles were excluded for not meeting the labelling criteria (articles focused on getting users to sign a petition, advertorials.)

In [11]:
def scrape_can_article(url):
    """
    Scrapes an article from a given URL on https://www.thecanary.co/uk/ and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Remove tweet embeds
            for twitter_blockquote in soup.find_all('blockquote', class_='twitter-tweet'):
                twitter_blockquote.decompose()
            # Remove ad elements
            for ads_div in soup.find_all('div', class_='ads_google_ads'):
                ads_div.decompose()

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
            
            # Category
            category_found = None
            yoast_script = soup.find('script', class_='yoast-schema-graph', type='application/ld+json')
            if yoast_script:
                try:
                    yoast_data = json.loads(yoast_script.string)
                    for item in yoast_data.get('@graph', []):
                        if item.get('@type') == 'NewsArticle':
                            section = item.get('articleSection')
                            if section:
                                if isinstance(section, list) and len(section) > 0:
                                    category_found = section[0].strip()
                                elif isinstance(section, str):
                                    category_found = section.strip()
                                break
                except json.JSONDecodeError:
                    pass
            # If we never found a category, use a default
            if category_found:
                article_data["category"] = category_found
            else:
                article_data["category"] = "Category not found"
            
            # Article copy
            article_body = soup.find('div', class_='jeg_inner_content')
            featured_image_patterns = [
                re.compile(r'^Featured image via .*$', re.IGNORECASE),
                re.compile(r'^Featured image supplied', re.IGNORECASE),
                re.compile(r'^Featured image and additional images via .*$', re.IGNORECASE),
                re.compile(r'^Featured image and additional images supplied$', re.IGNORECASE)
            ]
            if article_body:
                paragraphs = article_body.find_all('p')
                text_content = []
                
                for p in paragraphs:
                    if any(pattern.match(p.text.strip()) for pattern in featured_image_patterns):
                        p.decompose()
                    p_text = p.get_text().strip()
                    if p_text:
                        text_content.append(p_text)
                
                full_article = " ".join(text_content) if text_content else "Article content not found"
                article_data["text"] = clean_text(full_article)
            else:
                article_data["text"] = "Article content not found"
        
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")
    return article_data


# List of URLs to scrape
urls = get_urls_from_txt("canary.txt")
# Scrape articles and create a DataFrame
can_data_df = scrape_multiple_articles(urls, scrape_can_article)
# Store to CSV
can_data_df.to_csv("polarised_scraped_articles_can.csv", index=False)
# Print head 
can_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,Leicester university hunger strikers for Palestine reach two-week mark,"Hunger strikers at Leicester university have reached the two-week mark in their protest in solidarity with the Palestinian people. On Wednesday 15 January, five University of Leicester students went on hunger strike ""over the university's complicity [Israel's] in genocide"". Leicester Action for Palestine said this followed ""severe repression from the University, who had 11 people arrested in November for allegedly occupying the Attenborough tower"". An open letter in solidarity with the hunger strikers and arrestees explains that students want the university to ""cut ties with Barclays, arms companies and other companies targeted by BDS"". After ten days of hunger striking, students would require hospitalisation as a result of their protest. For this reason, Leicester Action for Palestine revealed at the weekend, three of the original five students had stopped their strike at that point. The group also described how the university has sent ""lots of information about how dangerous a prolonged hunger strike can be"", but has nonetheless ""chosen to delay our next negotiation meeting until the end of next the week"". This means ""forcing the strikers to extend their strike unnecessarily"". The students are ""risking their health and education"" in order to push the university to ""sever ties with companies complicit in the slaughter of thousands across the Middle East"". The university's inaction so far, Leicester Action for Palestine asserted, suggests the institution doesn't truly care about its students beyond a ""facade of wellbeing slogans and posturing"". If you want to support the students, you can share their story and their demands below. You can also sign their open letter in solidarity, and email the university registrar at [email protected].",Canary,2025-01-29T16:18:12+00:00,News,Polarised,https://www.thecanary.co/uk/news/2025/01/29/leicester-university-students-hunger-strike/
1,"Palestine Action hit FIFTEEN (yes, 15) offices of Allianz in one night","In a coordinated wave of actions across Europe, Palestine Action struck at 15 premises of the 'Allianz' company, investors in and insurers of Elbit Systems, Israel's largest weapons company. Allianz's provision of Employers Liability Insurance to Elbit Systems UK renders the insurance giant deeply complicit in the genocide in Gaza, as – without insurance – Elbit could not operate in Britain. German firm Allianz, which is also a major institutional investor in Elbit, had nine of its premises struck with red paint and smashed windows on Tuesday 28 January. The actions happened across England, Scotland, and Ireland. Across Europe, actions took place against Allianz sites in Germany, Netherlands, and Portugal. This was London: Allianz suffered a similar fate in Birmingham: Glasgow got hit: As did Newcastle: The direct action campaign against Allianz launched last October, when Palestine Action visited ten branches of Allianz overnight, covering them in graffiti, and symbolic blood-red paint, in addition to occupying their offices in Guildford. As Palestine Action said then: These nationwide actions serve as a reminder that, throughout the past twelve months, Western capital has continued to profit from the mass murder of Palestinians. Then, as now, Palestine Action has called upon Allianz to drop Elbit as a client now and in the future, and to divest entirely from shareholdings in Elbit Systems. Having failed to confirm that this action was taken, further visits were made to Allianz offices in Belfast and Glasgow. Since then, Israel has continued to murder Palestinians, and others throughout the region, using weapons developed and made in Britain, by 'Elbit Systems UK' – despite ceasefires. While Elbit weaponry continues to slaughter Palestinians, the campaign will continue against the companies, including Allianz, which facilitates Elbit's weapons production. The insurance services rendered to Elbit are crucial for the maintenance of their operations, while Allianz has previously been described as Elbit's ""principle institutional shareholder"", at-one-point owning over 2% of the company, and to-this-day continuing to hold thousands of shares in Elbit Systems Ltd. A spokesperson for Palestine Action said: Palestine Action's campaign to remove Israel's weapons trade from Britain will continue to use the necessary means, direct action, against Elbit and the companies that aid, abet, and profit from the slaughter of Palestinians. As Allianz prioritises the value of profits over the lives of Palestinians, we will ensure they lose more than they gain by working with the Israeli weapons maker.",Canary,2025-01-29T16:50:37+00:00,News,Polarised,https://www.thecanary.co/uk/news/2025/01/29/palestine-action-allianz-israel/
2,Met Police already threatening peaceful protesters at the 'Lord Walney' 16 appeal hearing,"16 Just Stop Oil supporters are appealing their draconian sentences at the Court of Appeal today and tomorrow. The mass appeal concerns 16 political prisoners with combined sentences of 41 years handed down between July and September 2024. They are known as the Lord Walney 16. On Thursday, the second day of the hearing, at noon, the campaign group Defend Our Juries will stage a lawful and peaceful protest outside the Royal Courts of Justice to highlight the wrongful silencing and jailing of the political opponents of the arms and oil industries. A spokesperson for the Met Police said: We're aware of plans for a protest outside the Royal Courts of Justice. Officers will be deployed in the area to ensure any incidents are swiftly dealt with. People gathered on the first day of the appeal: All 16 Just Stop Oil supporters were jailed in the months following the publication of a report to the government written by a paid lobbyist for the oil and arms industry that called for groups such as Just Stop Oil and Palestine Action to be banned in a similar way to terrorist organisations. John Woodcock, formerly a Labour MP and now a crossbench peer, published a report on 'political violence' in May 2024, which was falsely presented to the public and Parliament as 'independent'. The appeal concerns the four and five year prison sentences handed down in July 2024 to Just Stop Oil co-founder Roger Hallam and four co-defendants for taking part in a Zoom call to discuss a planned protest on the M25. It also includes the 20 month sentence imposed on 78 year old grandmother Gaie Delap and four co-defendants who carried out the M25 gantry protest in November 2022, as well as the 18 month to 3 year sentences imposed on four supporters who dug and occupied tunnels under a road leading to the Navigator Oil Terminal in Essex in August 2022. Also under review are the sentences imposed by Judge Hehir on the ""soup throwers"" Phoebe Plummer and Anna Holland. They were sentenced to two years and twenty months respectively after they threw soup at the glass covering Van Gogh's Sunflowers in October 2022, leaving the painting unharmed. A Just Stop Oil spokesperson said: Our broken political system is on trial today. This case is not about whether peaceful climate defenders deserve to be punished with long prison sentences. It is about whether it is acceptable in a democracy to allow wealthy fossil fuel executives to dictate our laws, pervert our criminal justice system and silence all opposition to their destructive and harmful business. Just Stop Oil supporters in prison are political prisoners. They are not there because they disrupted or harmed everyday people – if that were the case, the water company bosses, Post Office execs and those responsible for the Grenfell disaster would be behind bars. No, they are there because Just Stop Oil threatens the profits of the fossil fuel industry. We say to the government you can lock us up but more people will take our place as the extreme consequences of climate breakdown become more apparent. We will not be deterred, we are more afraid of the collapse of ordered civil society than arrest and imprisonment. We are acting in self-defence and to protect our families and communities. We refuse to be bystanders to the ultimate crime against humanity and life on earth. The appeal will be heard by the most senior Judge in England and Wales, Lady Chief Justice Baroness Carr, Mr Justice Lavender and Mr Justice Griffiths. The 16 are represented by Danny Friedman KC and Brenda Campbell KC. Greenpeace and Friends of the Earth are supporting the group with written submissions to the courts and up to 1,000 supporters are expected to show solidarity by gathering outside the court tomorrow. The outcome of the appeal could be decided any time from a few days to eight weeks following the hearing. At the trial of the 'Whole Truth Five', Judge Hehir ruled that climate issues were 'irrelevant and inadmissible', dismissing them as mere 'political opinion and belief. He handed down a sentence of five years to Just Stop Oil co-founder Roger Hallam, while Daniel Shaw, Lucia Whittaker De Abreu , Louise Lancaster and Cressida Gethin were each sentenced to four years. Michel Forst, UN special rapporteur has previously described these sentences as 'not acceptable in a democracy'. Also among the group of 16 is Gaie Delap, the 78-year-old grandmother, who was recalled to prison just before Christmas, because SERCO failed to provide a tag that could fit a woman's wrist. Gaie was sentenced, alongside four co-defendants, to twenty months imprisonment in August 2024 for her part in an action on the M25 in November 2022. Four (including Gaie) were released early, three of whom have been successfully tagged. On day one, Wednesday 29 January, BBC News reported that: Danny Friedman KC, one of several lawyers representing the group, said some of the sentences were ""the highest of their kind in modern British history"". He said that if the terms were allowed to stand, it would mark a ""paradigm shift in this area of criminal law sentencing"". The court was told the protesters ""acted in the knowledge that they would be prosecuted"", but ""none of the applicants acted out of self-interest"". Mr Friedman said: ""What these applicants did by way of collective, non-violent protest, whether one likes it or not, was for the interests of the public, of the planet, and of future generations."" ""They did what they did out of sacrifice,"" he added. The Labour government continues to present Lord Walney to the public as 'independent', encouraging judges to act on his recommendations. Despite a previous commitment by the Home Office to review his role by the end of October 2024, Lord Walney remains in position. Prior to Lord Walney's report in May 2024, jail sentences for peaceful protest in Britain were unusual.",Canary,2025-01-29T15:55:00+00:00,News,Polarised,https://www.thecanary.co/uk/news/2025/01/29/just-stop-oil-lord-walney-16-appeal/
3,A scathing new report on child poverty just shamed the Labour Party,"The UK government won't see progress on child poverty by the end of this parliament – even with high economic growth – if investment in social security does not form a part of its child poverty strategy. As the Joseph Rowntree Foundation (JRF) publishes its annual UK Poverty report, new analysis shows that under central OBR projections, only Scotland will see child poverty rates fall by 2029, demonstrating the power of social security policy in tackling poverty. In the central scenario, without further policy changes, by 2029: While the Scottish Government does appear to be making progress, it will remain some way off reaching its child poverty reduction targets without further action. JRF is also cautioning that children mustn't pay the price for the ups and downs in the economy. Any cuts to welfare spending are very likely to pull more families into poverty, as our social security system is already out of step with the costs families are facing. The leading annual barometer of poverty from the JRF finds that in the UK: The child poverty outlook across the four nations is shameful, with only Scotland showing some improvement JRF examined changes in child poverty levels between January 2025 and January 2029 based on different assumptions about the growth of the UK economy. If the UK economy grows in line with the Office for Budget Responsibility's (OBR) forecast over the next 4 years, child poverty rates in Scotland, already lower than the rest of the UK, will fall further by 2029. This results in a difference of nearly 10 percentage points between Scotland and the rest of the UK by 2029, up from 7 percentage points in 2025. A strong economy can increase wages and employment but will not in itself reduce poverty. Even if the UK economy grows significantly more than expected, overall child poverty rates show little change and could even rise if growth benefits higher income households more than lower income ones. Specific, targeted policies are needed if child poverty rates are to come down. JRF analysis shows that none of the 9 English regions are likely to see a fall in child poverty between 2024 and 2029, with 5 regions modelled as having increases over the period and the remaining regions showing no change. In previous years, differences in child poverty rates across the UK nations were driven by lower average housing costs in Scotland and Northern Ireland. However, JRF's latest analysis shows a similar reduction in poverty levels before housing costs are taken into account for children in Scotland compared to the rest of the UK. This strongly suggests that welfare policies, such as the Scottish Child Payment and mitigating the two-child limit from 2026, which boost the incomes of the parents of who receive them, are behind Scotland bucking the trend of rising child poverty rates elsewhere in the UK. The UK Government's child poverty strategy must abolish the two-child limit and introduce a protected minimum amount of support to Universal Credit. Later this year the UK Government will publish an 'ambitious' cross-government child poverty strategy. Any respectable child poverty strategy must include action on social security. Currently, our social security system doesn't reflect the cost of life's essentials as well as the reality that some families have higher costs or need to make one income stretch further, including larger families and lone parent families. These families are disproportionately impacted by specific welfare policies such as the two-child limit and the benefit cap. Along with abolishing the two-child limit, the UK Government must introduce a protected minimum amount of support below Universal Credit's current basic rate. This would restrict the amount that benefit payments can be reduced by the benefit cap. This would also represent a first step towards an Essentials Guarantee in Universal Credit, ensuring that everyone can afford essentials like food and household bills. Paul Kissack, Chief Executive of the Joseph Rowntree Foundation, says: Growing levels of poverty and insecurity are acting as a tightening brake on growth and opportunity. We can't expect children to be ready for school or able to learn if they're going without the basics. Growing up in poverty can also lead to poor health, increasing pressure on the NHS. Child poverty will only be driven down through focused, deliberate and determined policy action. Even very strong economic growth won't automatically change the picture. Policy action must start with the system designed to help people meet their costs of living – social security. At the moment that system is not only failing to do its job but, worse, actively pushing some people into deeper poverty, through cruel limits and caps. The good news is that change – meaningful change to people's lives – is possible and can be achieved quickly. We know this from our recent history, and from different approaches across the UK. The British public believes that everyone should be able to afford the essentials. With its child poverty strategy later this year the Government has the opportunity to show it agrees. Any credible child poverty strategy must include policies that rebuild the tattered social security system. The wellbeing of millions of children depends on that. And so do the Government's wider ambitions for improved living standards and opportunity. JRF's annual UK Poverty report also finds that poverty rates in 2022/23 were broadly flat, remaining at a similar level to before the pandemic [9]: Responding to the report, Cllr Arooj Shah, Chair of the Local Government Association's Children and Young People Board, said: No child should ever grow up in poverty and this report underlines the urgency of the Government's Child Poverty Strategy. As this report confirms, the most effective way to support low-income families and lift them out of poverty is through an adequately resourced national safety net. This needs to be alongside sustainable long-term funding for vital local services provided by councils, such as advice services, local welfare assistance, housing and employment support. We are engaging with the Government on its proposed strategy and working with them to ensure that every child has the best possible start in life.",Canary,2025-01-29T10:29:02+00:00,Analysis,Polarised,https://www.thecanary.co/uk/analysis/2025/01/29/jrf-child-poverty/
4,DWP benefits like PIP are officially GOOD for the economy - despite Labour's propaganda,"Just as the Labour Party government is tabling reforms and cuts to Department for Work and Pensions (DWP) benefits like PIP, a new report by a think tank has exposed that, far from being a drain on public finances, health and disability-related welfare is actually good for both the claimant and society more broadly. For 3.5 million people in the UK, DWP benefits like Personal Independence Payment (PIP) and Disability Living Allowance (DLA) are a lifeline. These programs cover essential extra living costs, such as mobility aids and support for daily activities. Yet, the value of these benefits transcends mere financial support. Think tank Pro Bono Economics has released a new report. It's called More than money: The life-long wellbeing impact of disability benefits. The study reveals that receiving chronic illness or disability benefits increases life satisfaction by an average of 0.79 points on a 10-point scale. This improvement becomes even more pronounced over time, reaching 1.1 points after four years of receiving support. For individuals in poor health, the impact is even greater, with life satisfaction increasing by 1.2 points. This boost reflects not only financial relief but also enhanced mental health, social inclusion, and reduced anxiety​. The monetary value of this wellbeing improvement is staggering. According to the study, the annual wellbeing benefit for each recipient of DWP PIP or DLA is valued at £12,300. This translates to an estimated £42 billion in total annual benefits for all recipients, far outweighing the £28 billion cost of providing these benefits. For every £1 spent on disability support, the UK economy gains £1.48 in wellbeing benefits​. Cutting these benefits would not only diminish the quality of life for millions but also negate these substantial economic gains. The government must recognise that disability benefits are an investment in public health and societal wellbeing, not a drain on public resources. Behind the DWP PIP statistics are real people whose lives have been transformed by disability benefits. Take Peter, for example, a single Autistic man in his fifties who also lived with communication difficulties. Before receiving PIP, Peter relied on food banks and struggled with severe mental health challenges due to financial stress. The award of PIP doubled his income, enabling him to pay off debts, improve his mental health, and escape the cycle of poverty. Peter described the support as giving him ""a new life"". Similarly, Anatoli and Agnes, a refugee couple, faced insurmountable financial difficulties due to Anatoli's disabilities. Their approval for PIP provided the financial stability needed to access healthcare and rebuild their lives in the UK. Stories like these highlight the irreplaceable role of disability benefits in fostering dignity and independence​. Despite their proven benefits, not all eligible individuals receive disability support. Complex application processes, lack of awareness, and fear of rejection deter many from claiming their entitlements. For those who attempt to navigate the system, the process can exacerbate mental health challenges and anxiety​. In 2023/24, nearly 37% of DWP PIP awards were granted for mental health conditions, reflecting the growing need for support in this area. Yet, proposals to tighten eligibility criteria or add further barriers could prevent many vulnerable individuals from accessing the assistance they desperately need​. The government's rationale for reducing disability benefits may stem from a desire to control public spending. However, the evidence suggests that such cuts would be a false economy. The wellbeing benefits of disability support via DWP PIP not only enhance recipients' lives but also generate significant economic returns. Furthermore, reducing benefits would likely increase demand on other public services, such as healthcare and social care, exacerbating existing pressures on these systems. Improving, rather than restricting, access to disability benefits is the way forward. Simplifying application processes, raising awareness about eligibility, and collaborating with charities can ensure that support reaches those who need it most. Organisations like Z2K already play a vital role in helping disabled individuals navigate the system, but their efforts must be complemented by systemic reforms​. As the government considers reforms to the disability benefit system, it must prioritise the wellbeing of disabled people. Maintaining—and even expanding—current levels of support is not only a moral imperative but also an economic necessity. Any reforms should focus on reducing barriers and improving access, ensuring that those eligible can claim their entitlements without undue hardship. Disability benefits like DWP PIP are not a luxury; they are a lifeline for millions of people facing systemic inequalities and additional living costs. Cutting these benefits would harm some of the most vulnerable members of society and undermine the economic and social fabric of the nation. The evidence is clear: investing in disability benefits is investing in a healthier, happier, and more inclusive society.",Canary,2025-01-29T12:24:35+00:00,Analysis,Polarised,https://www.thecanary.co/uk/analysis/2025/01/29/dwp-pip-health-benefits/


In [12]:
len(can_data_df)

100

**Breitbart**

Articles have been scrapped from the News section in reverse chronological order: https://www.breitbart.com/news/source/breitbart-news/ Articles with a category of "clips" and "radio" were excluded as they are media content. Article range is January 17th to 20th 2025.

In [13]:
def scrape_bb_article(url):
    """
    Scrapes an article from a given URL on https://www.breitbart.com and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
            
            # Category
            cat_meta = soup.find('meta', property='article:categories')
            if cat_meta and cat_meta.get('content'):
                article_data["category"] = cat_meta['content'].split(',')[0]
            else:
                article_data["category"] = "No category found"

            # Article copy
            main_content = soup.find('div', class_='entry-content')
            if main_content:
                # Remove tweets
                tweet_iframes = main_content.find_all('iframe', class_='bnn-if-tweet')
                for tw in tweet_iframes:
                    tw.decompose()
                # Remove images and captions
                image_captions = main_content.find_all("div", class_="wp-caption aligncenter")
                for div in image_captions:
                    div.decompose()
                # Remove reporter promo paragraph
                follow_pattern = re.compile(
                    r'(?i)\bfollow\b.*?(facebook|twitter|instagram|truth\s*social|x|@[a-z0-9_.-]+|email)',
                    re.IGNORECASE
                )
                all_paras = main_content.find_all("p")
                for p in all_paras:
                    para_text = p.get_text(strip=True)
                    if follow_pattern.search(para_text):
                        p.decompose()
                    elif "reporter for Breitbart News" in para_text:
                        p.decompose()
                    elif "Breitbart News Daily airs on SiriusXM" in para_text:
                        p.decompose()
                    elif "Order your copy today" in para_text:
                        p.decompose()

                # Extract text
                raw_text = main_content.get_text(separator=" ", strip=True)

                article_data["text"] = clean_text(raw_text)
            else:
                article_data["text"] = "Article body not found"
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")

    return article_data

# List of URLs to scrape
urls = get_urls_from_txt("breitbart.txt")
# Scrape articles and create a DataFrame
bb_data_df = scrape_multiple_articles(urls, scrape_bb_article)
# Store to CSV
bb_data_df.to_csv("polarised_scraped_articles_bb.csv", index=False)
# Print head 
bb_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,ICC Prosecutor Leading Charge Against Israel Meets Syria's Jihadi Overlords,"International Criminal Court (ICC) prosecutor Karim Khan visited Damascus, Syria, this weekend to meet with jihadi warlord Ahmed al-Sharaa, the de factor leader of the country after the fall of the Assad family regime. In a message the ICC published on social media, the world court said British lawyer Khan expressed gratitude to ""Syrian authorities"" for ""open & constructive discussions"" regarding holding war criminals and others accountable following the resolution of the Syrian Civil War. Syria endured over a decade of civil war under deposed dictator Bashar Assad that evolved into a melee featuring both fighting between the Assad regime and several opposition militias and a host of terrorist, separatist, and state actors fighting each other in Syria for a variety of reasons. The context of the Syrian civil war allowed the Islamic State to carve out land for a ""caliphate"" in the northern region of Raqqa that was ultimately eradicated through collaboration between the United States and the Syrian Democratic Forces (SDF), a coalition of Kurdish-led militias that largely avoided fighting for or against Assad. The war ended in early December when Assad fled the country for Russia. Ahmed al-Sharaa, formerly known by his jihadist name Abu Mohammed al-Jolani, became the de facto leader of the country as the head of the al-Qaeda offshoot militia Hayat Tahrir al-Sham (HTS). HTS launched a surprise assault of Assad forces in late November in Aleppo, Syria's second-largest city, that sent Assad forces fleeing. The striking success of HTS in Aleppo led to successive captures of territory from Idlib to Damascus; the militia's arrival to the capital prompted Assad to flee. Human rights groups and the United Nations have documented widespread evidence that Assad and several other actors in the civil war committed war crimes, crimes against humanity, and other atrocities. The ICC is an international court with jurisdiction to prosecute individuals for three types of crimes: genocide, war crimes, and crimes against humanity. Khan's visit to Syria was reportedly intended to begin the process of formal investigations potentially leading to ICC convictions. Reuters reported that Sharaa's nascent regime invited Khan to discuss war crimes. Khan proclaimed himself pleased with conversations with Sharaa on the possibility of international justice for Syrian civil war crimes. ""Some of the remarks coming out of Syria by the transitional government seem to have indicated an openness to justice and accountability for crimes that may have taken place,"" Reuters quoted Khan as saying. ""I think we're happy to take part in the conversation to tell them the options that they have."" The visit was reportedly a "" surprise "" stop for Khan and the ICC did not offer any specific steps forward for its participation in Syrian justice. Syria is not a signatory to the Rome Statute, which established the ICC, so it does not have to accept ICC jurisdiction. The ICC statements and quotes from Khan did not indicate that he discussed in any depth with Sharaa the crimes that HTS terrorists may have committed themselves during the decade-plus of its existence, or what the new Syrian regime would do to defend the human rights of its beleaguered civilians. HTS is a U.S.-designated terrorist organization that sprang out of al-Qaeda. American authorities were offering a $10 million bounty for Sharaa himself, as the leader of the jihadists, until former President Joe Biden rescinded the reward in December. Sharaa, now wearing Western-style suits instead of military fatigues, has offered vague public statements asserting that he would lead an ""inclusive"" government and respect the existence of religious and ethnic minorities in the country, but also affirmed that the government replacing Assad would be Islamist. ""We take pride in our culture, our religion and our Islam. Being part of the Islamic environment does not mean the exclusion of other sects. On the contrary, it is our duty to protect them,"" Sharaa said in an interview in December. Prior to the HTS takeover of the country, Sharaa told CNN that ""people who fear Islamic governance either have seen incorrect implementations of it or do not understand it properly."" Religious minorities, particularly Christians and Alawite Muslims, have expressed alarm at HTS becoming the de facto government of their country. Religious persecution experts have warned that the jihadists have a history of persecuting non-Sunni Muslims and Christians are not safe under HTS. ""HTS, with its al-Qaeda/ISIS roots, has historically been very violent towards Christian minorities, which should mean increased persecution,"" Jeff King, the president of International Christian Concern (ICC), told Breitbart News this month. ""The fall of Aleppo to these groups [Christians] will signify the beginning of the end for one of the last significant Christian strongholds in the region if unchecked."" Critics noted Khan's apparent lack of interest in minority persecutions in contrast to his energetic attempts to prosecute the government of Israel for defending itself following the terrorist atrocities by the jihadists of Hamas on October 7, 2023. Khan requested arrest warrants for Israeli Prime Minister Benjamin Netanyahu and his defense minister at the time, Yoav Gallant, claiming they were engaging in crimes against humanity in the Hamas-controlled Gaza region. The ICC issued the warrants in November. Israeli Foreign Minister Gideon Saar condemned Khan for meeting with the HTS leadership following his visit. ""He [Khan] already ran to Damascus to meet with al-Julani, head of HTS (designated as a terrorist organization by the UN Security Council), and former al-Qaeda operative,"" Saar wrote in a social media message. ""So much for 'international legal institutions'. Show me who your friends are and I'll tell you who you are."" ""Karim Kahn didn't find the time to come to Israel, a democratic country governed by the rule of law and with an independent judiciary, before issuing arrest warrants against its democratically elected leaders,"" Saar observed.",Breitbart,2025-01-20T20:16:03+00:00,Politics,Polarised,https://www.breitbart.com/middle-east/2025/01/20/icc-prosecutor-leading-charge-against-israel-meets-syrias-jihadi-overlords/
1,"Exclusive — Senate Sources: FBI, Senate Democrats Reason for Tulsi Gabbard’s Slow-Walked Confirmation","Former Rep. Tulsi Gabbard, President Donald Trump's nominee to be Director of National Intelligence (DNI), is one of the only members of his national security team in his Cabinet to not even have a Senate hearing scheduled yet. Senate sources familiar with the matter told Breitbart News on Monday that the FBI's background check of Gabbard is taking close to the maximum amount of time the FBI has had to conduct it and is expected back to the Senate Intelligence Committee sometime in the next 24 to 48 hours. As soon as the FBI background check documents are filed with the committee, Sen. Tom Cotton (R-AR)—who chairs the Intelligence Committee—is expected to formally file a notice to schedule Gabbard's confirmation hearing. That's where Senate Democrats come in. If they do not agree to expedite Gabbard's hearing—Sen. Mark Warner (D-VA) is the main player here, as he is the top Democrat on the Senate Intelligence Committee—then the committee under its rules needs to notice the hearing a full seven days later from the notice when it's filed. That means Gabbard's confirmation hearing could drag into next week, and her full confirmation before the Senate could drag into February depending on how ridiculous Democrats get with this whole process. A lot of people close to Gabbard and Trump want to see the Senate Republicans really aggressively throw down hard against the Senate Democrats on this matter, too, multiple sources familiar with the process told Breitbart News. At this point, though, Senate Republicans have not aggressively publicly called out Democrats too much here. That could change in the days after President Trump's and Vice President JD Vance's inauguration on Monday, as sources close to Senate leaders say Republicans are quickly growing frustrated with Democrat obstruction and deep state efforts to slow-walk Trump's cabinet picks. The Senate Intelligence Committee requires nominees it considers to submit multiple elements of paperwork. The first element is Office of Government Ethics filings which Gabbard has already submitted. The second is the FBI background check, and the window for that to be finalized is quickly closing. An agreement between the Trump Transition and the federal government limits the FBI background check timing to 14 days, and the FBI began its process with Gabbard around Jan. 8, per Senate sources. That means the documents should be in front of the committee by later this week, Wednesday at the latest, and as soon as that happens Cotton is expected to notice the hearing for exactly 7 days later. But again, if Warner and the Democrats dropped their nonsense, Cotton could get the hearing scheduled faster—perhaps as early as Wednesday or Thursday of this week. That's what happened with Trump's pick to lead the CIA, John Ratcliffe, who actually had his hearing before the FBI documents were submitted and might even be confirmed by the full U.S. Senate by later on Monday depending on if Democrats consent to move forward. Ratcliffe has the votes for confirmation, and will be confirmed; the only question is whether the Senate Democrats agree to streamline the confirmation vote or resist speedy confirmation. Interestingly, four years ago when former President Joe Biden's pick for DNI was up for confirmation, the Senate agreed to confirm her–Avril Haines–literally on Inauguration Day. Haines got 84 votes, including from several of the Republicans who have yet to wholeheartedly endorse Gabbard. Haines has been a complete radical leftist, so there is zero justification for any Republican who voted for Haines not voting Gabbard. These include Sens. Susan Collins (R-ME) and Todd Young (R-IN). Anything less than full support for Gabbard after they voted for Haines would be seen as a complete betrayal of the nation's security and a demented denialism of the election results from November 5. What's more, the longer these establishment Republicans hold off on publicly backing Gabbard and the longer Democrats keep blowing through every procedural trick in the book to drag this out longer, they are harming national security and these senators would end up being personally responsible for emboldening bad actors on the world stage. In addition, there is a radical leftist who supports ""Diversity, Equity, and Inclusion"" who is currently serving as the acting DNI until such time as Gabbard is confirmed: Of course, any Senate Republican who does not publicly back Gabbard at this time and work aggressively to speed up the confirmation process and pressure Democrats to get on board with moving quickly would be personally responsible for this insanity being installed at the office of the DNI in the Trump administration right now.",Breitbart,2025-01-20T20:08:56+00:00,Politics,Polarised,https://www.breitbart.com/politics/2025/01/20/exclusive-senate-sources-fbi-senate-democrats-reason-tulsi-gabbards-slow-walked-confirmation/
2,Trump: Official U.S. Government Policy Is 'There Are Only Two Genders',"President Donald Trump in his inauguration speech on Monday unequivocally rejected the gender insanity pushed by the Biden administration over the past four years. The 45th and 47th president reclaimed biological reality in the U.S. Capitol building in Washington, DC, and promised to restore U.S. government policy to reflect the truth that there are only two sexes. ""As of today, it will henceforth be the official policy of the United States government that there are only two genders: male and female,"" he said, garnering a standing ovation from those in attendance. The outgoing Biden administration was unabashed in its efforts to push gender ideology and transgenderism both in the United States and abroad , hoisting transgender flags at agencies , embassies , and the White House . Specifically, the far-left administration attempted, mostly through unelected bureaucrats and rulemaking , to codify ""gender identity"" into all aspects of life , ignoring the differences between the equal, yet complimentary sexes. One outcome of gender ideology has been the attempted erasure of women's sports and spaces, including locker rooms , restrooms , and even prisons . Most notably, the Biden administration backed sex change drugs and procedures for children, and fought for confused children to access mutilating drugs and surgeries, even though such methods can lead to infertility . Transgender activists frequently claim that such sex-mutilating drugs and procedures for confused minors reduce suicides and improve mental health — dubious claims which increasingly appear untrue as more studies and data come to light. Many people who have undergone these sex change drugs and procedures as minors and later decided to reverse course, called detransitioners , have begun speaking out about the irreversible physical damage and mental torment they have experienced . RELATED STORY: Detransitioner: Genital Surgery 'Destroyed My Life' — 'You Need to Be Insane' to Think This Helps Patients In stark contrast, Trump promised on the campaign trail and after the election to protect women's sports and spaces and children from experimental procedures. ""With a stroke of my pen on day one, we're going to stop the transgender lunacy,"" Trump said in a speech at Turning Point USA's AmericaFest in December. ""And, I will sign executive orders to end child sexual mutilation — get transgender out of the military and out of our elementary schools, and middle schools, and high school."" ""And, we will keep men out of women's sports,"" Trump added. ""And, that will likewise be done on day one. Should I do day one, day two, or day three? How about day one? Under the Trump administration, it will be the official policy of the United States government, that there are only two genders, male and female.""",Breitbart,2025-01-20T19:48:21+00:00,Politics,Polarised,https://www.breitbart.com/politics/2025/01/20/trump-in-inauguration-speech-official-policy-of-u-s-government-is-there-are-only-two-genders-male-and-female/
3,Javier Milei Condemns Brazil for Keeping Bolsonaro from Trump Inauguration,"President of Argentina Javier Milei blamed the ""regime"" of Brazilian radical leftist President Luiz Inácio Lula da Silva for not allowing conservative former President Jair Bolsonaro to travel to the United States for the inauguration of President Donald Trump. ""He [Bolsonaro] is my friend. I'm very sorry that the Lula regime won't let him come,"" Milei told Brazilian reporters on Saturday during his attendance at the Hispanic Inaugural Ball. Bolsonaro, a staunch supporter of President Trump and the MAGA movement, has been banned from leaving his country since February 2024 after Supreme Federal Tribunal (STF) Justice Alexandre de Moraes ordered local police to seize his passport as part of a broad probe into an alleged ""coup"" plot following his narrow defeat in the 2022 presidential election. Shortly after the election of President Trump and Vice President JD Vance in November, Bolsonaro stated that, should he receive an invitation, he would seek permission from Brazil's STF to have his passport temporarily returned so that he could attend. STF Justice de Moraes – a self-styled ""anti-fake news crusader"" and rapporteur of several open cases against Bolsonaro who has ordered police raids against Bolsonaro and his family — denied Bolsonaro's request and subsequent appeal last week. De Moraes claimed the email invitation he received from the Trump Vance Inaugural committee was not sufficient proof he was invited. De Moraes also justified refusing the request by claiming that Bolsonaro represented a ""flight risk"" and that he showed ""indications"" that he may try to flee Brazil and claim political asylum. Bolsonaro's wife, former Brazilian First Lady Michelle Bolsonaro traveled to the United States to represent him, instead. His son Eduardo, a Brazilian lawmaker, was also invited. Jair Bolsonaro posted a photo of his wife and son in America on Monday and congratulated Trump on his inauguration. While unable to travel, Bolsonaro accompanied his wife to Brasilia's international airport on Saturday and told reporters that he was ""upset"" over de Moraes' refusal to allow him to travel to the United States. ""It would obviously be great for me to go there. So much so that President Trump invited me. I'm upset. I'm still shaken,"" Bolsonaro said. The former Brazilian President also criticized de Moraes and denounced being the target of a ""blatant persecution"" by the STF justice who, Bolsonaro asserted, is doing ""whatever he wants"" with the intention of ""eliminating"" Brazil's right wing. ""Unfortunately, I was unable to attend this event in the United States and without having been convicted a single time,"" Bolsonaro said. ""They will not defeat us with narratives. A person in the Supreme Court cannot be the owner of the truth, the owner of the world."" January 20 marks the first presidential inauguration in U.S. history that features international heads of state among its guests. President Luiz Inácio Lula da Silva, who reportedly did not receive an invitation, will be represented by Ambassador Maria Luiza Viotti. According to Brazilian outlets, a delegation of 21 lawmakers, including Bolsonaro's son Eduardo Bolsonaro, attended the U.S. Presidential inauguration ceremony. In remarks given during a Monday morning meeting with his ministers, Lula said that he hopes the United States continues to be a ""historic partner"" of Brazil throughout Trump's second administration. ""There are those who say that Trump's election may cause problems for world democracy. As president of Brazil I hope that he will have a profitable administration (...) and that the Americans will continue to be Brazil's historical partner,"" Lula said. On Sunday, Eduardo Bolsonaro published a video on his Instagram account of former First Lady Michelle Bolsonaro holding a video call with Jair Bolsonaro while she participated in a dinner on the eve of January 20. View this post on Instagram A post shared by Eduardo Bolsonaro (@bolsonarosp)",Breitbart,2025-01-20T19:46:30+00:00,Politics,Polarised,https://www.breitbart.com/latin-america/2025/01/20/javier-milei-condemns-brazil-for-keeping-bolsonaro-from-trump-inauguration/
4,President Donald Trump to Declare National Emergency at Southern Border,"President Donald Trump will declare a national emergency at the southern border, he announced Monday during his inaugural address in the U.S. Capitol rotunda. Trump wielded the critical issue of border security with incredible effectiveness during his campaign against Vice President Kamala Harris, and his policy-heavy inaugural speech made clear he will move swiftly to act on his campaign promises. ""Today, I will sign a series of historic executive orders,"" Trump said. ""With these actions, we will begin the complete restoration of America and the revolution of common sense. It's all about common sense."" He continued, ""First, I will declare a national emergency on our southern border. All illegal entry will immediately be halted and we will begin the process of returning millions and millions of criminal aliens back to the places from which they came. We will reinstate my Remain in Mexico policy, I will end the practice of catch and release, and I will send troops to the southern border to repel the disastrous invasion of our country."" Trump will go even further. ""Under the orders I sign today, we will also be designating the cartels as foreign terrorist organizations,"" he said. ""And by invoking the Alien Enemies Act of 1798, I will direct our government to use the full and immense power of federal and state law enforcement to eliminate the presence of all foreign gangs and criminal networks, bringing devastating crime to U.S. soil, including our cities and inner cities."" ""As Commander in Chief, I have no higher responsibility than to defend our country from threats and invasions, and that is exactly what I am going to do,"" Trump said. ""We will do it at a level that nobody has ever seen before.""",Breitbart,2025-01-20T19:12:44+00:00,Politics,Polarised,https://www.breitbart.com/politics/2025/01/20/president-donald-trump-declares-national-emergency-southern-border/


In [14]:
len(bb_data_df)

100

In [29]:
def scrape_kos_article(url):
    """
    Scrapes an article from a given URL on https://www.dailykos.com and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            if not published_date_meta:
                # Fallback to noscript timestamp
                timestamp_span = soup.select_one(".story__timestamp span.timestamp")
                if timestamp_span and 'data-epoch-time' in timestamp_span.attrs:
                    # Convert timestamp to human-readable date
                    epoch_time = int(timestamp_span['data-epoch-time']) / 1000  # Convert milliseconds to seconds
                    human_readable_date = datetime.utcfromtimestamp(epoch_time).strftime('%Y-%m-%d %H:%M:%S')
                    article_data["date"] = human_readable_date
                else:
                    article_data["date"] = "Published date not found"
            else:
                article_data["date"] = published_date_meta['content']
                
            # Category
            category_meta = soup.find('meta', property='article:section')
            article_data["category"] = category_meta['content'] if category_meta else "Category not found"

            # Article text
            story_content_divs = [
                div for div in soup.find_all('div', class_='story__text')
                if 'placeholder' not in div.get('class', [])
            ]
            
            if story_content_divs:
                paragraphs = []
                exclusion_phrases = [
                    "Donate now to support",
                    "Join us on Bluesky", "Bluesky Starter Pack", "staff accounts on Bluesky", "Daily Kos is on Bluesky",
                    "Your reader support means everything", "please donate just $3", 
                    "value having free and reliable access", "Daily Kos is supported by readers like you.", "Can you chip in today?"
                ]
                
                for div in story_content_divs:
                    for p in div.find_all('p', recursive=False):  # Direct <p> children only
                        text = p.get_text(strip=True)
                        if not any(phrase in text for phrase in exclusion_phrases) and not text.startswith("Donate now to support"):
                            paragraphs.append(text)
                
                raw_article = ' '.join(paragraphs)
                article_data["text"] = clean_text(raw_article )

        
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")

    return article_data

# List of URLs to scrape
urls = get_urls_from_txt("kos.txt")
# Scrape articles and create a DataFrame
kos_data_df = scrape_multiple_articles(urls, scrape_kos_article)
# Store to CSV
kos_data_df.to_csv("polarised_scraped_articles_kos.csv", index=False)
# Print head 
kos_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,Nebraska went big for Trump—and that may kill its economy,"Nearly 60% of Nebraskavotedfor Donald Trump last November. There is perhaps no state more dependent on immigrants than Nebraska. Oops. Thisexcellent NPR storyhighlights the challenges this Trump-loving state now faces as a result of its voters' choices. ""Nebraska is one of the top meat producers in the U.S. It also has one of the worst labor shortages in the country,"" reporter Jasmine Garsd writes. ""For every 100 jobs, there are only 39 workers, according to the U.S. Chamber of Commerce."" She mentions the executive director of the state's pork producer's association as smiling ""wearily"" as colleagues urge attracting more immigrants to the state to help fill positions ... and yet they vote for the guy who wants to deport them all. On the other hand, there remains a staunch belief that Trump won't actually carry out his mass deportation threats. ""There's no way it can,"" the pork guy says about the deportations. And for now, maybe he's right. Trump seems more interested in usingperformative raidsin Chicago and other sanctuary cities to demonize local Democratic politicians and officials who refuse to do his bidding (which they aregenerally permitted to do). Trump may rip a few dozen undocumented immigrants out of their new community, but he's more interested in a raid's propaganda value than he is in its results. If he really wants to deport masses of undocumented immigrants, there's an obvious place to start: red states. Many Republican governors have offered to help. Take Nebraska. ""I am encouraged by the strength of President Trump's immigration and border security orders,"" Nebraska Gov. Jim Pillensaid in a statementthis past Tuesday. ""The state of Nebraska will support these efforts. On my return to Lincoln this week, I will issue an executive order to all state agencies directing them to cooperate to the full extent of the law with federal efforts to enforce our immigration laws and affirmatively support the apprehension of criminal aliens."" The NPR story quotes a lovely parishioner at an Episcopalian church who is working to serve and protect the state's immigrant community: ""I think there's still enough in our Nebraska DNA that we do depend on each other. We come from storms, weather incidents, where you depend on your neighbors and you go dig somebody out of a snowstorm. Even if you don't really like them, you go dig them out because it's what you do. Because we're Nebraska."" That parishioner says people in the state ""understand the economic necessity of [immigrant labor], and we are not stupid."" However, given Nebraska's overwhelming vote for Trump, that assertion seems debatable.",Daily Kos,2025-01-26 16:00:08,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/26/2298712/-Nebraska-went-big-for-Trump-and-that-may-kill-its-economy
1,The origins of Trump’s war on diversity,"Explaining the Rightis a weekly series that looks at what the right wing is currently obsessing over, how it influences politics—and why you need to know. Donald Trump spent his first week in office using the power of the presidency to attack decades of actions to promote diversity. Trumpissued executive ordersrolling back desegregation orders, instructing government offices that promote diversity to be shuttered, and laying off workers tasked with promoting inclusion. He even threatened federal employees, ordering them tosnitch on each otherto expose any pro-diversity positions that might have flown under the radar. Trump, of course, has along history of racism, and the conservative movement and Republican Party he leads has frequently voiced opposition to social progress. But the executive actions in his second administration are occurring with a level of vigor much higher than in his first term. So what happened? In 2020, the murder of George Floyd by Minneapolis police set offa series of protestsagainst racial injustice throughout the United States and the world. While such protests have always occurred, the uprising of 2020 happened at the height of the coronavirus pandemic—and while Trump was in the White House, pursuing a racist agenda that includedpraising Nazisand other right-wing extremists. Conservatives, still reeling from significant advances from Barack Obama's presidency, chose to strike back by more forcefully opposing pro-diversity efforts in government, academia, and business. Activists like Christopher Ruforallied conservativesto attack diversity, equity, and inclusion (DEI) programs and falsely claimed that the academic ""critical race theory,"" which discusses systemic racism, was being taught at all school levels rather than just in college. At the same time, conservative media—led by outlets like Fox News—attacked any and all advances toward racial and gender equity as ""woke,"" subverting the meaning of the term originally coined by Black activists to promote racial awareness. Suddenly, according to Fox,everything under the sunwas ""woke."" When M&M's added a female character to its advertising, Fox's Harris Faulkner complained that the brand was ""back to pushing woke politics, the company now introducing a new progressive pack of chocolates: women M&M's."" Conservative pundit Buck Sexton appeared on Fox to argue that lingerie company Victoria's Secret's choice to get rid of its ""angels"" concept was a sinister plot to ""go a little bit woke."" Similarly, leading Republicans like Florida Gov. Ron DeSantisdeclaredthat his state was at war with ""woke."" DeSantis ordered the state to only use red, white, and blue lights on bridges—prohibiting lights that celebrate events like Pride Month or Juneteenth. He even wenton the attackagainst one of the state's biggest and most beloved companies, Disney, because diversified some of its content. Trump was involved in all of this by promoting attacks on diversity from within the presidency, includingfalsely labelingthe racial justice protests as being purely violent. Trump is also a habitual consumer of Fox News and the network has a demonstrated ability to influence his world view, as can be seen by the slew of Fox talking headsnow populatinghis administration. While Trump was adopting this posture, many of the former members of his administration and their ideological allies were concoctingProject 2025, which isextremely focusedon rolling back many civil rights gains and other advances—like same-sex marriage—that have occurred in the ensuing decades. Now in his second term, Trump has installed Project 2025 architects in his Cabinet and is already implementing their wish list. Trump began his political careerriding a waveof racist right-wing anger after the United States elected its first Black president. And now, he and his party have the civil rights movement squarely in their crosshairs.",Daily Kos,2025-01-26 00:00:07,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/25/2298925/-The-origins-of-Trump-s-war-on-diversity
2,How Eric Adams’ downfall led him straight into Trump’s arms,"In a recent interview with former Fox News host Tucker Carlson, New York City Mayor Eric Adams alleged that the Democratic Party ""left"" him. ""People often say, 'Well, you know, you don't sound like a Democrat, and you know, you seemed to have left the party.' No, the party left me, and it left working-class people,"" said Adams, who is running for reelection as a Democrat against the backdrop of five federalcriminal charges. The roughly50-minuteinterview with Carlson, one of Donald Trump'smost ardent allies, is Adams' latest attempt to appeal to the president, whoexpressed a willingnessto pardon him for his crimes. It also represents a break from Adams' past attitudes toward Carlson.During his 2021 mayoral run, Adams said he didn't ""want or need the support of Tucker Carlson, or anyone else who perpetuates racist, anti-immigrant propaganda."" But a criminal case clearly changed his outlook. More recently, Adams' attempts to curry favor with the president have been downright pathetic, butwithout the helpof his constituents, Adams might need to bring in the big guns. During his conversation with Carlson, Adams criticized both the Biden administration and Democrats' immigration policies, which he onceclaimedthreatened to ""destroy New York City."" He also said that the indictment against him was politically motivated. ""You complained about allowing hundreds of thousands of illegals and this indictment was punishment for complaining,"" Carlson said. ""That is clearly my belief,"" Adams confirmed. Later, Carlson questioned why Republican-led states shouldn't bus asylum seekers to New York City. For years, red states like Texas havespent millions of dollarsbusing immigrants to blue cities like New York in what amounts to a cruel political stunt. ""Isn't it fair to send all their illegals here because you guys welcome them?"" Carlson asked, nodding to the fact that New York is asanctuary city, meaning it aims to shield immigrants from unwarranted enforcement actions, including arrest or deportation. ""No, we're not welcoming them. Let me be very clear,"" he said, adding thatformer President Joe Biden and his aides once told him to tamp down his criticism regarding the influx of immigrants to help the party during the 2024 presidential election. ""Basically, be a good Democrat, Eric,"" Adams said. ""That was the basic overall theme."" ""It appeared to me there was a bigger focus on the national election and not on what it was doing to the cities,"" Adams later said, nodding to the influx of immigrants in New York City that officials estimatecosts $5 billion. The interview aired on the Tucker Carlson Network Tuesday evening but, asPolitico reports, was never promoted on the mayor's public schedule. A spokesperson for Adamsalso told the outletthat it was Carlson who requested the interview. ""Mayor Adams does not believe we should be living in silos and speaking into echo chambers,"" the spokesperson said in a statement. ""At a time where our country is so divided, the mayor believes we must break out of our comfort zones and speak with everyone—even those we may not always agree with."" While Adams, at times, gently refused some of Carlson's claims, it was a relatively friendly chit-chat. But the timing of it is interesting, as it comes amid increased speculation that Adams is angling for a presidential pardon ahead of hisApril trial. Adams, who wasfirst electedin 2021, pleaded not guilty to the federal charges of bribery, conspiracy, wire fraud, and two counts of illegally soliciting a campaign contribution from a foreign national. Since at least October, Adams has tried to play nice with Trump and other Republicans. That month, after Trump defended Adams and suggested that both of them had been unfairly ""persecuted,"" Adamsavoided criticizingTrump. Now Adams is done playing coy. Shortly after Trump's reelection in November, Adamsrequested a meetingwith Trump's incoming border czar, Tom Homan, then later danced around whether he'dconsider switching parties. Adams was a registered Republican from 1995 to 2002 before becoming a Democrat. Indeed, if party switching or, at the very least, cozying up to Republicans who are close to the president, is the way to win favor with Trump and the GOP, Adams' most recent interview and actions would suggest he's all in. On Monday, Adams scrapped his public scheduleto make the trekto Washington, D.C., after the Trump administration extended a last-minute invite to his inauguration. Adams, whopractically beggedTrump for a ticket, made the pilgrimage in thewee hours of the night. He had met with the president in Palm Beach, Florida,merely days prior. ""Inauguration Day is a sacred American tradition. Our country has been through so much, and every president has the honor and responsibility to protect and lead the American people,"" Adamswrote on X. ""I believe there's much we can achieve working alongside the federal government as we support our city's values and fight for New Yorkers."" Even if the interview with Carlson was just a formality, Trump is reportedly lapping up Adams' interest in his administration. According toRolling Stone, the president's allies have rightly called Adams ""thirsty"" for Trump's attention and even questioned when the ""actual begging and love letters start."" Adams isunlikely to be reelectedlater this year, so maybe his newfound fealty to the GOP is part of his backup plan once he's out of office. Or maybe he's angling for a role in Trump's Cabinet. After all, it is arevolving doorof senior-level administrators.",Daily Kos,2025-01-25 22:00:08,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/25/2298270/-How-Eric-Adams-downfall-led-him-straight-into-Trump-s-arms
3,Maddow blasts Trump for his dangerous public health gag order,"The first few days of Donald Trump's second term have given Rachel Maddow a lot to cover, including the president's approach to public health. ""[Trump] has ordered that all information be stopped, including scientific information, to advise hospitals on how to deal with this emerging epidemic,"" Maddow said. The Trump administration's decision tocease all public-facing communicationfrom federal health agencies has been widely criticized. ""Is that popular? Is that a good idea? Is that perceived as a popular idea among the American people? Is that what you thought you were voting for?"" Maddow asked. ""You know what else isn't a popular idea? Bird flu."" Maddow went on to point out that the price of eggs hasskyrocketedas a result of the bird flu, inflating an already fraught economic situation. At the same time, Trump's choice to run the Department of Health and Human Services, Robert F. Kennedy Jr., has reportedlysolicited a raw milk purveyorto take up a role in the Food and Drug Administration. Scientists warn againstthe consumption of raw milkfor many reasons, least of which is that it canspread diseases like bird flu. ""In the meantime, stop releasing any information on this bird flu thing—anything,"" Maddow concluded. ""Honestly, it sounds scary. Maybe if we don't talk about it, maybe it'll go away. And by the way, pay no attention to the price of eggs."" Thebird fluhas affected livestock across the United States, from poultry to cattle. In January, thefirst human deathconnected with the disease was recorded in Louisiana.",Daily Kos,2025-01-25 18:00:02,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/25/2298931/-Maddow-blasts-Trump-for-his-dangerous-public-health-gag-order
4,Hegseth confirmed as Trump's defense secretary in tie-breaking vote,"The Senate confirmedPete Hegsethas the nation's defense secretary Friday in a dramatic late-night vote, swatting back questions about his qualifications to lead the Pentagon amidallegations of heavy drinking and aggressive behaviortoward women. Rarely has a Cabinet nominee faced such wide-ranging concerns about his experience and behavior as Hegseth, particularly for such a high-profile role atop the U.S. military. But the Republican-led Senate was determined to confirm Hegseth, a former Fox News host and combat veteran who has vowed to bring a ""warrior culture"" to the Pentagon, rounding out PresidentDonald Trump'stop national security Cabinet officials. Vice President JD Vance was on hand to cast a tie-breaking vote, unusual in the Senate for Cabinet nominees, who typically win wider support. Hegseth himself was at the Capitol with his family. Senate Majority LeaderJohn Thunesaid Hegseth, as a veteran of the Army National Guard who served tours in Iraq and Afghanistan, ""will bring a warrior's perspective"" to the top military job. ""Gone will be the days of woke distractions,"" Thune said, referring to the diversity, equity and inclusion initiatives being slashed across the federal government. ""The Pentagon's focus will be on war fighting."" The Senate's ability to confirm Hegseth despite a grave series of allegations against him will provide a measure of Trump's political power and ability toget what he wants from the GOP-led Congress, and of the potency of the culture wars to fuel his agenda at the White House. Next week senators will be facing Trump's otheroutside Cabinet choicesincluding particularly Kash Patel, a Trump ally who has published an enemies list, as the FBI director; Tulsi Gabbard as director of the office of national intelligence; and Robert F. Kennedy, Jr, the anti-vaccine advocate at Health and Human Services. ""Is Pete Hegseth truly the best we have to offer?"" said Sen. Jack Reed of Rhode Island, the top Democrat on the Senate Armed Services Committee, urging his colleagues to think seriously about their vote. Hegseth himself was working the phones late Friday to shore up his support, his confirmation at stake. ""He's a good man,"" Trump said of Hegseth while departing the White House to visit disaster-hit North Carolina and Los Angeles. ""I hope he makes it."" Trump leveled criticism of Sen. Lisa Murkowski of Alaska and Sen. Susan Collins of Maine, who announced they would vote against Hegseth. And Tump raised fresh questions about Sen. Mitch McConnell, R-Ky., saying, ""And of course Mitch is always a no vote, I guess. Is Mitch a no vote?"" In the end all three voted against Hegseth, as tensions soared late Friday at the Capitol. McConnell, the former GOP leader in the Senate, had not declared his vote, but signaled skepticism in an earlier speech when he declared he would confirm nominees to senior national security roles ""whose record and experience will make them immediate assets, not liabilities."" He voted against. It takes a simple majority to confirm Hegseth, and Republicans, with a 53-47 majority in the Senate, could only lose one more objection. One Republican, Sen. Thom Tillis of North Carolina, sent the Senate swirling as he raised questions and was provided information and answers, said a person familiar with the situation Thursday and granted anonymity to discuss it. But Tillis ultimately voted to confirm Hegseth who he said ""has a unique perspective"" and is passionate about modernizing the military. He said he spoke to Hegseth for ""nearly two hour"" about his concerns. Democrats, as the minority party, have helped confirm Secretary of StateMarco Rubioand CIA DirectorJohn Ratcliffein bipartisan votes to Trump's national security team within days of his return to the White House. But Democrats gravely opposed to Hegseth have little power to stop him, and instead have resorted to dragging out the process. Hours before the vote, Democrat after Democrat took to the Senate floor to object. Sen. Chris Murphy, D-Conn., said during the debate there are few Trump nominees as ""dangerously and woefully unqualified as Hegseth."" Hegseth facesallegations that he sexually assaulted a womanat a Republican conference in California, though he has denied the claims and said the encounter was consensual. He laterpaid $50,000 to the woman. More recently, Hegseth's former sister-in-lawsaid in an affidavitthat he was abusive to his second wife to the point that she feared for her safety. Hegseth has denied the allegation, and in divorce proceedings, neither Hegseth nor the woman claimed to be a victim of domestic abuse. During afiery confirmationhearing, Hegseth dismissed allegations of wrongdoing one by one, and vowed to bring ""warrior culture"" to the top Pentagon post. Hegseth has promised not to drink on the job if confirmed. But Republican senators facing an intensive pressure campaign by Trump allies to support Hegseth have stood by his nomination, echoing his claims of a ""smear"" campaign against him. A Princeton and Harvard graduate, Hegseth represents a newer generation of veterans who came of age in the aftermath of the Sept. 11, 2001, attacks. He went on to a career at Fox News as the host of a weekend show, and was unknown to many on Capitol Hill until Trump tapped him for the top Defense job. Hegseth's comments that women should have no role in military combat drew particular concern on Capitol Hill, including from lawmakers who themselves served. He has since tempered those views as he met with senators during the confirmation process. Murkowski said in a lengthy statement ahead of a test vote on Hegseth that his behaviors ""starkly contrast"" with what is expected of the military. ""I remain concerned about the message that confirming Mr. Hegseth sends to women currently serving and those aspiring to join,"" Murkowski wrote on social media. Collins said that after a lengthy discussion with Hegseth, ""I am not convinced that his position on women serving in combat roles has changed."" But one prominent Republican, Sen. Joni Ernst of Iowa, herself a veteran and sexual assault survivor, came under harsh criticism for her skepticism toward Hegseth and eventually announced she would back him. Hegseth would lead an organization with nearly 2.1 million service members, about 780,000 civilians and a budget of $850 billion. In exercising its advise and consent role over Trump's nominees, the Senate is also trying to stave off his suggestion that the GOP leaders simply do away with the confirmation process altogether, and allow him to appoint his Cabinet choices when the Congress is on recess. Trump raised the idea of so-called ""recess appointments"" during a private White House meeting with Thune and House Speaker Mike Johnson, a step many senators are trying to avoid.",Daily Kos,2025-01-25 03:09:35,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/24/2299021/-Hegseth-confirmed-as-Trump-s-defense-secretary-in-tie-breaking-vote


In [27]:
len(kos_data_df)

NameError: name 'kos_data_df' is not defined

In [28]:
# Combine DataFrames
polarised_dataset = pd.concat(
    [tcw_data_df, can_data_df, bb_data_df, kos_data_df],
    ignore_index=True
)

# Basic checks
print(polarised_dataset.info())   # Data types & non-null counts
print(polarised_dataset.head())   # Quick glance at first rows

# Print out the categories
print(polarised_dataset["category"].value_counts())

# Confirm 4 sites are represented
print("Number of unique sites:", polarised_dataset["site"].nunique())

#Check if any empty articles
empty = polarised_dataset[polarised_dataset["text"]=="Article text not found"]
print (empty)

NameError: name 'tcw_data_df' is not defined

In [24]:
# Store to CSV
polarised_dataset.to_csv("polarised_articles.csv", index=False)

### 3. Satire
Satirical content is intended to entertain or provoke thought through humor, exaggeration, or irony. Satire is often misunderstood as factual. 

##### Features:

- Humourous or Exaggerated Tone: Content is typically marked by wit, parody, or absurdity.
- Intentional Ridiculousness: The story is meant to be funny, not factual; outlandish claims serve comedic purposes.

##### Label If:

- The piece’s goal is clearly comedic or parodic, rather than deceptive.
- The tone, language, or disclaimers indicate it’s intentionally satirical.

##### Do Not Label If:

- The piece uses humour but is still intended to mislead (label as Fabricated Content).
- The piece is comedic but still pushing a heavily skewed narrative as if it’s true (label as Polarised Content).

##### Sources:
- The Onion (USA - 55 articles)
- Babylon Bee (USA - 50 articles)
- The Daily Squib (UK - 45 articles)
- Waterford Whispers (IE - 50 articles)


**The Onion**

The articles scraped are the ones featured on the 2024 "Annual Year" post found here: https://theonion.com/our-annual-year-2024/ - the top 5 from each month have been chosen (image posts have been excluded as per scope), so a total of 55 articles as December is excluded. The remaining 45 articles will be from the standard ratings hierachy found here: https://theonion.com/latest/

In [5]:
def scrape_onion_article(url):
    """
    Scrapes an article from a given URL on theonion.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", 
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Category
        category_element = soup.find('div', class_='taxonomy-category')
        category_link = category_element.find('a') if category_element else None
        article_data["category"] = category_link.text.strip() if category_link else "Category not found"
        
        # Article copy
        content_div = soup.find(
            "div",
            {"class": lambda x: x and "entry-content" in x and "single-post-content" in x}
        )
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data


# List of URLs to scrape
urls = get_urls_from_txt("onion.txt")
# Scrape articles and create a DataFrame
onion_data_df = scrape_multiple_articles(urls, scrape_onion_article)
# Store to CSV
onion_data_df.to_csv("satire_scraped_articles_onion.csv", index=False)
# Print head 
onion_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,Biden Addresses Nation While Hanging From Branch On Side Of Cliff,"WASHINGTON—Using his platform to plead for Americans to lend him a hand, President Joe Biden addressed the nation Monday while hanging from a branch on the side of a cliff. ""Our democracy has never before hung in the balance more than it has at this moment when I am in danger of plummeting 50 feet to those sharp rocks below,"" said Biden, who implored the U.S. populace to set aside its differences and find a long stick, a rope, or, preferably, a helicopter that they could use to return him to stable ground. ""What's important is not what led us to this point, but rather how we choose to move forward in helping me back up. Even a carefully placed mattress or pile of sofa cushions would do. My fellow Americans, I urge you to act fast, as a small bird has landed on my head and is now pecking at me."" At press time, a Gallup Poll had found that 70% of Americans opposed Biden being rescued.",The Onion,2024-01-01T11:45:00+00:00,Politics,Satire,https://theonion.com/biden-addresses-nation-while-hanging-from-branch-on-sid-1851106795/
1,Marriage Counselor Sides With Hotter Spouse,"ANCHORAGE, AK—Stating that she had heard both perspectives and could understand their frustrations, marriage counselor Laurie Hartford reportedly told couple David and Julia Carter that she ultimately had to side with the hotter spouse. ""So, I've listened to everything you've had to say, and I've come to the conclusion that while David does seem to be emotionally withholding, he's also at least two points hotter,"" said the therapist, who rushed to note that, in all fairness, she needed to take into consideration that she would at best describe the female half of the relationship ""as, like, a six even on her best day."" ""I've spent hours listening to you pour out your hearts and that's never easy, so pat yourselves on the back. But, frankly, only one of you has bothered to comb your hair or put on a nice shirt at these sessions. I'm not in any way trying to invalidate your experiences. All I'm saying is that only one of you—David—has an ass that you could bounce a quarter off, and the other one is kind of an uggo, if that makes sense?"" Hartford went on to say that it might be helpful if Julia stayed at home for their next sessions so that they could spend more time understanding where, exactly, David's hotness came from.",The Onion,2024-01-09T11:30:00+00:00,Local,Satire,https://theonion.com/marriage-counselor-sides-with-hotter-spouse-1851143488/
2,Wealthy Dad Surprises Child With Tree House He Can Airbnb For Passive Income,"WILMETTE, IL—Telling the child not to peek as they walked into the backyard, local wealthy man Kenneth Schweitz reportedly surprised his son Tuesday with a tree house that the young boy could Airbnb for passive income. ""It's time you got your own little space that can be rented out for short-term stays and used to produce a reliable revenue stream,"" a visibly excited Schweitz said as he took his hands off his son's eyes to reveal the fully appointed structure built into the tree's branches, stressing to the boy that he would not have to do any real work for the lodging to generate substantial returns. ""Your mom and I can help you decorate it, but then it's all up to you to decide how much to charge per night and which cleaning service to hire, bud. After that, you can sit back and collect thousands of dollars a month. How cool is that? You and your little friends are going to have so much fun building your little real estate empire. Enjoy!"" At press time, sources reported Schweitz's son was enthusiastically climbing into the tree house to serve an eviction notice to the low-income family currently living there.",The Onion,2024-01-09T17:30:00+00:00,Local,Satire,https://theonion.com/wealthy-dad-surprises-child-with-tree-house-he-can-airb-1851112919/
3,"Glowing, Pulsating Hair Product Takes Control Of Gavin Newsom’s Thoughts","SACRAMENTO, CA—As an otherworldly glow emanated from the California governor's meticulously sculpted coiffure, sources confirmed Friday that the pulsating hair product on Gavin Newsom's head had taken control of his thoughts. ""There will be no bills signed, no presidential campaign—there will only be hair,"" said the disembodied voice emanating from the greasy, slicked-back mass atop Newsom's skull, his hair reportedly growing into thick, powerful tendrils long enough to choke out his political opponents anywhere they might try to hide in the State Capitol. ""There will be no clemency for those who refuse to succumb to the wet and shiny hair. With these mighty strands, I command the wildfires and the earthquakes, the droughts and the floods!"" At press time, sources confirmed Newsom's hair product had evicted several homeless people seeking shelter within the throbbing gelatinous nest upon his head.",The Onion,2024-01-19T17:45:00+00:00,Politics,Satire,https://theonion.com/glowing-pulsating-hair-product-takes-control-of-gavin-1851160421/
4,Gen Z Announces Julie Andrews Is Problematic But Refuses To Explain Why,"​​NEW YORK—Standing before a crowd of millennials, Gen Xers, and baby boomers, members of Generation Z announced at a press conference Wednesday that actress Julie Andrews was problematic, but they refused to explain why. ""You know what she did—you just don't want to admit it,"" said Gen Z spokesperson Taylor Collaco, who rolled her eyes in response to requests from those who wanted to know what exactly theSound Of Musicstar had said or done to have earned the ostracism of millions of Americans ages 12 to 27. ""Yes, that Julie Andrews. Has she been so normalized that you can't even see it? Yikes. Oh, come on, it's not my job to educate you."" At press time, Gen Z had dropped a hint that it had something to do with the Genovian monarchy.",The Onion,2024-01-24T13:44:00+00:00,Entertainment,Satire,https://theonion.com/gen-z-announces-julie-andrews-is-problematic-but-refuse-1851180352/


In [6]:
len(onion_data_df)

100

In [7]:
onion_data_df

Unnamed: 0,title,text,site,date,category,class,url
0,Biden Addresses Nation While Hanging From Branch On Side Of Cliff,"WASHINGTON—Using his platform to plead for Americans to lend him a hand, President Joe Biden addressed the nation Monday while hanging from a branch on the side of a cliff. ""Our democracy has never before hung in the balance more than it has at this moment when I am in danger of plummeting 50 feet to those sharp rocks below,"" said Biden, who implored the U.S. populace to set aside its differences and find a long stick, a rope, or, preferably, a helicopter that they could use to return him to stable ground. ""What's important is not what led us to this point, but rather how we choose to move forward in helping me back up. Even a carefully placed mattress or pile of sofa cushions would do. My fellow Americans, I urge you to act fast, as a small bird has landed on my head and is now pecking at me."" At press time, a Gallup Poll had found that 70% of Americans opposed Biden being rescued.",The Onion,2024-01-01T11:45:00+00:00,Politics,Satire,https://theonion.com/biden-addresses-nation-while-hanging-from-branch-on-sid-1851106795/
1,Marriage Counselor Sides With Hotter Spouse,"ANCHORAGE, AK—Stating that she had heard both perspectives and could understand their frustrations, marriage counselor Laurie Hartford reportedly told couple David and Julia Carter that she ultimately had to side with the hotter spouse. ""So, I've listened to everything you've had to say, and I've come to the conclusion that while David does seem to be emotionally withholding, he's also at least two points hotter,"" said the therapist, who rushed to note that, in all fairness, she needed to take into consideration that she would at best describe the female half of the relationship ""as, like, a six even on her best day."" ""I've spent hours listening to you pour out your hearts and that's never easy, so pat yourselves on the back. But, frankly, only one of you has bothered to comb your hair or put on a nice shirt at these sessions. I'm not in any way trying to invalidate your experiences. All I'm saying is that only one of you—David—has an ass that you could bounce a quarter off, and the other one is kind of an uggo, if that makes sense?"" Hartford went on to say that it might be helpful if Julia stayed at home for their next sessions so that they could spend more time understanding where, exactly, David's hotness came from.",The Onion,2024-01-09T11:30:00+00:00,Local,Satire,https://theonion.com/marriage-counselor-sides-with-hotter-spouse-1851143488/
2,Wealthy Dad Surprises Child With Tree House He Can Airbnb For Passive Income,"WILMETTE, IL—Telling the child not to peek as they walked into the backyard, local wealthy man Kenneth Schweitz reportedly surprised his son Tuesday with a tree house that the young boy could Airbnb for passive income. ""It's time you got your own little space that can be rented out for short-term stays and used to produce a reliable revenue stream,"" a visibly excited Schweitz said as he took his hands off his son's eyes to reveal the fully appointed structure built into the tree's branches, stressing to the boy that he would not have to do any real work for the lodging to generate substantial returns. ""Your mom and I can help you decorate it, but then it's all up to you to decide how much to charge per night and which cleaning service to hire, bud. After that, you can sit back and collect thousands of dollars a month. How cool is that? You and your little friends are going to have so much fun building your little real estate empire. Enjoy!"" At press time, sources reported Schweitz's son was enthusiastically climbing into the tree house to serve an eviction notice to the low-income family currently living there.",The Onion,2024-01-09T17:30:00+00:00,Local,Satire,https://theonion.com/wealthy-dad-surprises-child-with-tree-house-he-can-airb-1851112919/
3,"Glowing, Pulsating Hair Product Takes Control Of Gavin Newsom’s Thoughts","SACRAMENTO, CA—As an otherworldly glow emanated from the California governor's meticulously sculpted coiffure, sources confirmed Friday that the pulsating hair product on Gavin Newsom's head had taken control of his thoughts. ""There will be no bills signed, no presidential campaign—there will only be hair,"" said the disembodied voice emanating from the greasy, slicked-back mass atop Newsom's skull, his hair reportedly growing into thick, powerful tendrils long enough to choke out his political opponents anywhere they might try to hide in the State Capitol. ""There will be no clemency for those who refuse to succumb to the wet and shiny hair. With these mighty strands, I command the wildfires and the earthquakes, the droughts and the floods!"" At press time, sources confirmed Newsom's hair product had evicted several homeless people seeking shelter within the throbbing gelatinous nest upon his head.",The Onion,2024-01-19T17:45:00+00:00,Politics,Satire,https://theonion.com/glowing-pulsating-hair-product-takes-control-of-gavin-1851160421/
4,Gen Z Announces Julie Andrews Is Problematic But Refuses To Explain Why,"​​NEW YORK—Standing before a crowd of millennials, Gen Xers, and baby boomers, members of Generation Z announced at a press conference Wednesday that actress Julie Andrews was problematic, but they refused to explain why. ""You know what she did—you just don't want to admit it,"" said Gen Z spokesperson Taylor Collaco, who rolled her eyes in response to requests from those who wanted to know what exactly theSound Of Musicstar had said or done to have earned the ostracism of millions of Americans ages 12 to 27. ""Yes, that Julie Andrews. Has she been so normalized that you can't even see it? Yikes. Oh, come on, it's not my job to educate you."" At press time, Gen Z had dropped a hint that it had something to do with the Genovian monarchy.",The Onion,2024-01-24T13:44:00+00:00,Entertainment,Satire,https://theonion.com/gen-z-announces-julie-andrews-is-problematic-but-refuse-1851180352/
5,MrBeast Announces He Has Resurrected Everyone Buried At Arlington National Cemetery,"GREENVILLE, NC—Telling viewers of his latest charitable video to prepare themselves for his ""most epic challenge yet,"" 25-year-old influencer Jimmy ""MrBeast"" Donaldson announced Friday that he had resurrected everyone buried at Arlington National Cemetery. ""You might not know this, but sadly, over 400,000 of our nation's most decorated veterans have to spend eternity dead and underground,"" the content creator says in the video, explaining that he has found a scientist willing to help him reanimate the dead and has spent millions of dollars to pump 75,000 volts of electricity into each plot of the hallowed cemetery. ""Thanks to this amazing procedure, every single one of these deserving corpses has crawled out of the ground and begun roaming the earth again, totally free of charge to them and their families. They'll also receive $10,000 in cash. Frankly, it's a tragedy we've let these American heroes rot in their graves for this long."" Reacting to a moment in the video when MrBeast gives a reanimated World War II veteran a Lamborghini, viewers expressed outrage toward the apparent ingratitude of the undead soldier, who only screams, ""Why did you do this? Kill me...kill me!"" into the camera.",The Onion,2024-02-02T13:24:00+00:00,News,Satire,https://theonion.com/mrbeast-announces-he-has-resurrected-everyone-buried-at-1851217565/
6,Introverted Cowboy Struggling To Round Up Posse,"BANDERA, TX—Admitting that he was actually a lot more shy and reserved than folks might think, introverted cowboy Cassidy Walsh sheepishly told reporters Friday that he'd been struggling lately to round up a posse. ""While I might seem confident and outgoing at times, the truth is, I'm the sort of feller who needs to recharge at the end of a long day ridin' the range with a bunch of cowhands,"" said Walsh, adding that he also experienced ""a might fair bit of social anxiety"" that probably stemmed from a fear his attempts to organize a posse would end in rejection. ""Don't get me wrong, I enjoy spending time with ol' buckaroos like myself. It's just that your pal Cassidy can only handle so much hootin' and hollerin' before he plumb runs out of steam. Now if you'll excuse me, I'm gonna kick up my spurs and snuggle up in my bedroll with a Louis L'Amour novel."" At press time, another successful train robbery had reportedly been carried out in the area by a tireless gang of extroverted outlaws.",The Onion,2024-02-06T15:16:00+00:00,Local,Satire,https://theonion.com/introverted-cowboy-struggling-to-round-up-posse-1851226175/
7,Country Stations Refuse To Play Beyoncé’s Music After Artist Condemns Iraq War,"HOUSTON—Calling the popular musician traitorous for failing to support President George W. Bush in a time of crisis, thousands of country stations across America reportedly refused to play Beyoncé's music Thursday after the artist condemned the Iraq War. ""If she doesn't want to support our troops risking their lives out there for the cause of freedom, then we don't need her,"" said country radio executive Hunter Roeloffs, one of many station owners who blacklisted the recent singles ""Texas Hold 'Em"" and ""16 Carriages"" after controversial comments in which the star expressed reservations about the U.S.-led Coalition invasion of Iraq—remarks that also led to a reported drop in ticket sales and Beyoncé losing a sponsorship deal with Lipton. ""Unlike Miss Knowles, we're proud Americans here at 100.3 the Bull. We support freedom, whether it's here or in the Middle East. So when she says innocent lives will be lost, I can't help but wonder how she could possibly think a bloodthirsty dictator like Saddam Hussein is innocent. And then there's that line of hers about being ashamed of President Bush? Well, we're ashamed of her. How about that?"" At press time, Beyoncé had attracted additional criticism from the country music scene after rebranding herself as the Chicks.",The Onion,2024-02-15T19:00:00+00:00,Entertainment,Satire,https://theonion.com/country-stations-refuse-to-play-beyonce-s-music-after-a-1851261135/
8,"‘Stab Him! Stab Him, You Cowards!’ Says Terrified Kamala Harris To Aides After Plunging First Knife Into Biden’s Back","WASHINGTON—Moments after pulling shut the door to the Roosevelt Room and locking it behind her, a terrified Vice President Kamala Harris reportedly told aides to ""Stab him! Stab him, you cowards!"" on Friday after she plunged a knife into President Joe Biden's back. ""What are you waiting for, you fools? Strike now! Strike before the opportunity goes cold!"" said the blood-dappled vice president, who, as her staff appeared to grow uncertain of the blades in their shaking hands and backed away toward the exit, reminded each panicked aide in turn that they had pledged their fealty for this day. ""Think of all I've promised you. Think of all we stand to gain. Quick, now, the first blow has been rendered. There is no going back. We're confederates in this. We must act now or be damned by inaction!"" At press time, sources confirmed President Biden had complained to an assistant of a tightness in his shoulder and returned to the Oval Office with the knife still protruding from his back.",The Onion,2024-02-16T11:15:00+00:00,Politics,Satire,https://theonion.com/stab-him-stab-him-you-cowards-says-terrified-kamal-1851243467/
9,Emerging Filmmaker Malia Obama Changes Surname To Scorsese,"PARK CITY, UT—Noting that she did not want her parents' fame to distract from her Sundance premiere, industry sources confirmed Thursday that emerging filmmaker Malia Obama had changed her surname to 'Scorsese.' ""Although her legal name is still Obama, Malia is officially promoting her short filmThe Heartunder the pseudonym Malia Martin Scorsese,"" said Sundance spokesperson Shelby Fleming, adding that the 25-year-old had been using the more neutral, nondescript moniker since writing for Donald Glover's television seriesSwarm. ""When people see the last name Scorsese, they don't see the daughter of a former president. They see a blank slate. She's hopeful this slight change will help people take her art much more seriously."" At press time, Obama announced that her next film would be a gritty portrait of 1970s Little Italy titledMean Streets.",The Onion,2024-02-22T18:40:00+00:00,Entertainment,Satire,https://theonion.com/emerging-filmmaker-malia-obama-changes-surname-to-scors-1851278946/


**Babylon Bee**

Articles from the Greatest Hits page (https://babylonbee.com/news?sort=greatest-hits) have been scraped. The categories "Christian Living" and "Scripture" were excluded for being too niche. The articles range from 2017 to 2022. The final 15 came from the trending news section (https://babylonbee.com/news?sort=buzzing), all from January to February 2025.


In [18]:
def scrape_bee_article(url):
    """
    Scrapes an article from a given URL on babylonbee.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print (soup)
        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date       
        published_date_meta = soup.find('meta', {"name": "published_at"})
        if published_date_meta and published_date_meta.get("content"):
            article_data["date"] = published_date_meta["content"].split()[0]
        else: "Published date not found"
        
        # Category
        category_link = soup.find("a", href=lambda href: href and "/news/categories/" in href)
        if category_link:
            article_data["category"] = category_link.get_text(strip=True)
        else:
            article_data["category"] = "Category not found"
            
        # Article copy
        content_div = soup.select_one("div.article-content")
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# List of URLs to scrape
urls = get_urls_from_txt("bee.txt")
# Scrape articles and create a DataFrame
bee_data_df = scrape_multiple_articles(urls, scrape_bee_article)
# Store to CSV
bee_data_df.to_csv("satire_scraped_articles_bee.csv", index=False)
# Print df 
bee_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,Trump: 'I Have Done More For Christianity Than Jesus',"WASHINGTON, D.C. - In response to theChristianity Todayeditorial calling for his removal, Trump called the magazine a ""left-wing rag"" and said, ""I have done more for Christianity than Jesus."" ""I mean, the name of the magazine isChristianity Today, and who is doing more for Christians today? Not Jesus. He disappeared; no one knows what happened to him. But I'm out there every day protecting churches from crazy liberals."" While Trump admitted that Jesus did do some things for Christianity in the past, Trump said he was doing more now and it was more substantial. ""I'm appointing judges to help protect religious rights,"" Trump stated. ""How many judges has Jesus appointed? He says something about judging people in the future, but I ain't seen it."" Furthermore, Trump asserted that he ""saved Christmas."" ""Look what I've done,"" he said. ""You can say 'Merry Christmas' now. In fact, if you say 'Happy Holidays' and don't immediately make it clear you're referring to Christmas, you go to prison. What has Jesus ever done for Christmas? Be born? He wants credit for that? Come on.""",The Babylon Bee,2019-12-23,Politics,Satire,https://babylonbee.com/news/trump-i-have-done-more-for-christianity-than-jesus
1,Senate To Be Replaced With Room Full Of Monkeys Throwing Feces,"WASHINGTON, D.C. - In an emergency, overnight referendum, the American people voted on Thursday to replace the United States Senate with a room full of monkeys throwing feces. The measure passed with 57% of the vote. 22% of voters thought the Senate should be replaced by barking seals, while 17% voted that the replacement should be the pit of venomous snakes from Indiana Jones. 3.97% voted that Senate members be replaced by screaming goats. ""About 100 people"" voted for the current Senators to keep their jobs, with this tiny voting bloc centered in Washington, D.C. Highland Ape Rescue out of West Virginia will be teaming up with Cornwell Primate farms to supply hundreds of monkeys and apes to the Senate. The animals will be fed a nutritious mixture of foods that produce easily throwable feces. Protective glass will be put up around the Senate for camera crews to safely film, but anyone being interviewed by the new senators will have to sit in the middle of the poo-flinging octagon, coming under a heavy barrage of projectile excrement. ""It will be a huge improvement from how things were before,"" said ape trainer, Marlena Henwick. ""No more 10-12 hour hearings. With these monkeys, all the fecal projectiles will have been flung in under 30 minutes. One and done."" The recently replaced senators will be placed on display at the National Zoo in Washington, D.C. for families to observe and zoologists to study.",The Babylon Bee,2018-09-28,Politics,Satire,https://babylonbee.com/news/senate-to-be-replaced-with-room-full-of-monkeys-throwing-feces
2,Motorcyclist Who Identifies As Bicyclist Sets Cycling World Record,"NEW YORK, NY - In an inspiring story from the world of professional cycling, a motorcyclist who identifies as a bicyclist has crushed all the regular bicyclists, setting an unbelievable world record. In a local qualifying race for the World Road Cycling League, the motorcyclist crushed the previous 100-mile record of 3 hours, 13 minutes with his amazing new score of well under an hour. Professional motorcycle racer Judd E. Banner, the brave trans-vehicle rider, was allowed to race after he told league organizers he's always felt like a bicyclist in a motorcyclist's body. ""Look, my ride has handlebars, two wheels, and a seat,"" he told reporters as he accepted a trophy for his incredible time trial. ""Just because I've got a little extra hardware, such as an 1170-cc flat-twin engine with 110 horsepower, doesn't mean I have any kind of inherent advantage here."" Banner also said he painted the word ""HUFFY"" on the side of his bike, ensuring he has no advantage over the bikes that came out of the factory as bicycles. Some critics say he needs to cut off his motor in order to make the competition fairer, but he quickly called these people bigots, and they were immediately banned from professional cycle racing.",The Babylon Bee,2019-10-25,Sports,Satire,https://babylonbee.com/news/motorcycle-that-identifies-as-bicycle-sets-world-cycling-record
3,Trump Announces He Has Hidden 5 Golden Tickets Among Stimulus Checks,"WASHINGTON, D.C. - Trump has built up a lot of buzz over the coming stimulus payments, saying he has hidden five golden tickets among the checks heading to Americans this week. Anyone who gets a golden ticket will win a free tour of Mar-a-Lago. Rumor has it that Trump will be watching them closely to see which of the winners has the qualities he looks for in a manager, with the best candidate getting hired as Mar-a-Lago's onsite McDonald's manager. ""Who will win? Nobody knows!"" Trump said gleefully as he carefully signed each of the golden tickets before hiding them among the stimulus checks. ""I, Donald Trump, have decided to allow five Americans - just five, mind you, and no more - to visit my resort this year. These lucky five will be shown around personally by me, and they will be allowed to see all the secrets and the magic of my hotel and golf resort -- the best golf, maybe ever. Then, at the end of the tour, as a special present, all of them will be given Season 1 ofThe Apprenticeon DVD!"" ""So watch out for the Golden Tickets! Five Golden Tickets have been printed on golden paper, and these five Golden Tickets have been hidden in your stimulus checks. These five may be anywhere - in any mailbox in the country. And the five lucky finders of these five Golden Tickets are the only ones who will be allowed to visit my Mar-a-Lago during the lockdown. Good luck to you all!"" Unfortunately, he put all five golden tickets in a stimulus envelope addressed to Jim Acosta.",The Babylon Bee,2020-04-15,Politics,Satire,https://babylonbee.com/news/trumps-says-5-golden-tickets-to-be-hidden-among-stimulus-checks
4,NBA Players Wear Special Lace Collars To Honor Ruth Bader Ginsburg,"LOS ANGELES, CA - NBA players are honoring the life of Ruth Bader Ginsburg this week by wearing pretty lace collars just like Notorious RBG used to wear. In a touching show of respect for the late Justice Ginsburg, and in solidarity with her progressive cause, Lebron James and the LA Lakers took to the court yesterday wearing a stunning variety of delicate white collars inspired by RBG's wardrobe. According to several commentators on ESPN, the virtual teleconference crowd fell silent in reverent awe as the players all knelt down and chanted ""RBG! RBG! RBG!"" ""Yeah, RBG was an amazing person,"" said LeBron James after the game. ""I have her biography right here and I totally read it right before the game. She was a judge. That's cool, I respect that. Judges judge things and not everyone can do that. She believed in Black Lives Matter and being on the right side of history and stuff."" Power forward Anthony Davis also expressed his happiness with the collars. ""It's good to honor her today with these lacey things. Commissioner Adam Silver and President Xi Jinping told us to wear them so we did. I just took this little doily thing from under a table lamp at my mom's house and cut a hole in the middle. Easy."" NBA players are vowing to wear the collars until Trump is removed from office, or until angry rioters burn their basketball arenas down, whichever comes first.",The Babylon Bee,2020-09-22,Politics,Satire,https://babylonbee.com/news/nfl-to-adorn-all-uniforms-with-lace-doilies-in-to-honor-rbg


In [19]:
bee_data_df

Unnamed: 0,title,text,site,date,category,class,url
0,Trump: 'I Have Done More For Christianity Than Jesus',"WASHINGTON, D.C. - In response to theChristianity Todayeditorial calling for his removal, Trump called the magazine a ""left-wing rag"" and said, ""I have done more for Christianity than Jesus."" ""I mean, the name of the magazine isChristianity Today, and who is doing more for Christians today? Not Jesus. He disappeared; no one knows what happened to him. But I'm out there every day protecting churches from crazy liberals."" While Trump admitted that Jesus did do some things for Christianity in the past, Trump said he was doing more now and it was more substantial. ""I'm appointing judges to help protect religious rights,"" Trump stated. ""How many judges has Jesus appointed? He says something about judging people in the future, but I ain't seen it."" Furthermore, Trump asserted that he ""saved Christmas."" ""Look what I've done,"" he said. ""You can say 'Merry Christmas' now. In fact, if you say 'Happy Holidays' and don't immediately make it clear you're referring to Christmas, you go to prison. What has Jesus ever done for Christmas? Be born? He wants credit for that? Come on.""",The Babylon Bee,2019-12-23,Politics,Satire,https://babylonbee.com/news/trump-i-have-done-more-for-christianity-than-jesus
1,Senate To Be Replaced With Room Full Of Monkeys Throwing Feces,"WASHINGTON, D.C. - In an emergency, overnight referendum, the American people voted on Thursday to replace the United States Senate with a room full of monkeys throwing feces. The measure passed with 57% of the vote. 22% of voters thought the Senate should be replaced by barking seals, while 17% voted that the replacement should be the pit of venomous snakes from Indiana Jones. 3.97% voted that Senate members be replaced by screaming goats. ""About 100 people"" voted for the current Senators to keep their jobs, with this tiny voting bloc centered in Washington, D.C. Highland Ape Rescue out of West Virginia will be teaming up with Cornwell Primate farms to supply hundreds of monkeys and apes to the Senate. The animals will be fed a nutritious mixture of foods that produce easily throwable feces. Protective glass will be put up around the Senate for camera crews to safely film, but anyone being interviewed by the new senators will have to sit in the middle of the poo-flinging octagon, coming under a heavy barrage of projectile excrement. ""It will be a huge improvement from how things were before,"" said ape trainer, Marlena Henwick. ""No more 10-12 hour hearings. With these monkeys, all the fecal projectiles will have been flung in under 30 minutes. One and done."" The recently replaced senators will be placed on display at the National Zoo in Washington, D.C. for families to observe and zoologists to study.",The Babylon Bee,2018-09-28,Politics,Satire,https://babylonbee.com/news/senate-to-be-replaced-with-room-full-of-monkeys-throwing-feces
2,Motorcyclist Who Identifies As Bicyclist Sets Cycling World Record,"NEW YORK, NY - In an inspiring story from the world of professional cycling, a motorcyclist who identifies as a bicyclist has crushed all the regular bicyclists, setting an unbelievable world record. In a local qualifying race for the World Road Cycling League, the motorcyclist crushed the previous 100-mile record of 3 hours, 13 minutes with his amazing new score of well under an hour. Professional motorcycle racer Judd E. Banner, the brave trans-vehicle rider, was allowed to race after he told league organizers he's always felt like a bicyclist in a motorcyclist's body. ""Look, my ride has handlebars, two wheels, and a seat,"" he told reporters as he accepted a trophy for his incredible time trial. ""Just because I've got a little extra hardware, such as an 1170-cc flat-twin engine with 110 horsepower, doesn't mean I have any kind of inherent advantage here."" Banner also said he painted the word ""HUFFY"" on the side of his bike, ensuring he has no advantage over the bikes that came out of the factory as bicycles. Some critics say he needs to cut off his motor in order to make the competition fairer, but he quickly called these people bigots, and they were immediately banned from professional cycle racing.",The Babylon Bee,2019-10-25,Sports,Satire,https://babylonbee.com/news/motorcycle-that-identifies-as-bicycle-sets-world-cycling-record
3,Trump Announces He Has Hidden 5 Golden Tickets Among Stimulus Checks,"WASHINGTON, D.C. - Trump has built up a lot of buzz over the coming stimulus payments, saying he has hidden five golden tickets among the checks heading to Americans this week. Anyone who gets a golden ticket will win a free tour of Mar-a-Lago. Rumor has it that Trump will be watching them closely to see which of the winners has the qualities he looks for in a manager, with the best candidate getting hired as Mar-a-Lago's onsite McDonald's manager. ""Who will win? Nobody knows!"" Trump said gleefully as he carefully signed each of the golden tickets before hiding them among the stimulus checks. ""I, Donald Trump, have decided to allow five Americans - just five, mind you, and no more - to visit my resort this year. These lucky five will be shown around personally by me, and they will be allowed to see all the secrets and the magic of my hotel and golf resort -- the best golf, maybe ever. Then, at the end of the tour, as a special present, all of them will be given Season 1 ofThe Apprenticeon DVD!"" ""So watch out for the Golden Tickets! Five Golden Tickets have been printed on golden paper, and these five Golden Tickets have been hidden in your stimulus checks. These five may be anywhere - in any mailbox in the country. And the five lucky finders of these five Golden Tickets are the only ones who will be allowed to visit my Mar-a-Lago during the lockdown. Good luck to you all!"" Unfortunately, he put all five golden tickets in a stimulus envelope addressed to Jim Acosta.",The Babylon Bee,2020-04-15,Politics,Satire,https://babylonbee.com/news/trumps-says-5-golden-tickets-to-be-hidden-among-stimulus-checks
4,NBA Players Wear Special Lace Collars To Honor Ruth Bader Ginsburg,"LOS ANGELES, CA - NBA players are honoring the life of Ruth Bader Ginsburg this week by wearing pretty lace collars just like Notorious RBG used to wear. In a touching show of respect for the late Justice Ginsburg, and in solidarity with her progressive cause, Lebron James and the LA Lakers took to the court yesterday wearing a stunning variety of delicate white collars inspired by RBG's wardrobe. According to several commentators on ESPN, the virtual teleconference crowd fell silent in reverent awe as the players all knelt down and chanted ""RBG! RBG! RBG!"" ""Yeah, RBG was an amazing person,"" said LeBron James after the game. ""I have her biography right here and I totally read it right before the game. She was a judge. That's cool, I respect that. Judges judge things and not everyone can do that. She believed in Black Lives Matter and being on the right side of history and stuff."" Power forward Anthony Davis also expressed his happiness with the collars. ""It's good to honor her today with these lacey things. Commissioner Adam Silver and President Xi Jinping told us to wear them so we did. I just took this little doily thing from under a table lamp at my mom's house and cut a hole in the middle. Easy."" NBA players are vowing to wear the collars until Trump is removed from office, or until angry rioters burn their basketball arenas down, whichever comes first.",The Babylon Bee,2020-09-22,Politics,Satire,https://babylonbee.com/news/nfl-to-adorn-all-uniforms-with-lace-doilies-in-to-honor-rbg
5,"In Bold Anti-Trump Statement, Pelosi Rips Up Bible","WASHINGTON, D.C. - In a bold, powerful statement to oppose Trump, Speaker of the House Nancy Pelosi solemnly tore up the Bible after Trump was seen holding one up in front of a church. At a press conference, the Speaker of the House held up a Bible and then ripped it in two, declaring that she was against anything Trump was associated with. ""If Trump is for the Bible, then I am against it,"" she said as she struggled to rip the Bible in half. Finally, aides came to intervene, pre-ripping the spine of the Bible so it would be easier for her to tear. ""All the books of the Bible are bad: Genesis, Joseph, the one with the big fish, even Hezekiah. We must stand against Trump's bigotry by ripping up anything he claims to be for."" ""Yass, queen! Slay!"" shouted her fans at the press conference as she finally managed to rip the Bible up. ""You're my president!"" In a genius move, Trump then held up a Koran in front of a mosque, forcing Pelosi to tear up a Koran and alienate the left.",The Babylon Bee,2020-06-03,Politics,Satire,https://babylonbee.com/news/pelosi-rips-up-bible
6,Biden Cuts Hole In Mask So He Can Still Sniff People's Hair,"WASHINGTON, D.C. - Joe Biden has committed to wearing a mask in public to be a good example and to prevent the spread of COVID-19. Aides were disappointed and a little frightened, however, when Biden immediately cut a large hole in the middle of the mask so he could continue to invade people's personal space and sniff their hair, necks, and faces. Staffers usually don't let Biden play with sharp objects, but he managed to find some safety scissors stashed behind the Metamucil in his campaign bus. Using the purple plastic scissors, he cut a large hole and then fitted the mask to his face, confident that he was protecting himself and others from the virus. ""That's better,"" he said as he cut a big hole for his schnoz. ""Now I'm protecting against infection and I'm still able to give the ladies a good sniff. You know, in my day, I wore a mask just like this, as was the fashion at the time. All the kids at the pool would ask to play with the mask, and they'd run their fingers through it. In fact, one time, a gangster named CornPop was about to go cause some trouble at the sock hop, and I put some rocks in my mask and started swinging it around like a sling. You know, real Daniel and Goliath type stuff. He looked at me, tears in his eyes, and promised never again to go out and cause a ruckus."" ""Anyway, that's why I'm your best choice for senator of the Roman Empire. Vote for Joe!"" Biden suddenly came to and realized he was standing in a Walmart parking lot talking to a hobo.",The Babylon Bee,2020-04-09,Politics,Satire,https://babylonbee.com/news/biden-cuts-holes-in-medical-mask-so-he-can-still-sniff-people
7,Man Identifying As 6-Year-Old Crushes Game-Winning Homer In Tee-Ball Championship,"AUBURN, CA - Local 36-year-old man Nate Ripley, who identifies as a six-year-old, ""absolutely crushed"" a game-winning homer at a local tee-ball game and won the championship for his team Monday evening, reports confirmed. Ripley reportedly walked up to the plate in the bottom of the 6th, pointed his bat toward the left-field wall looming 130 feet in the distance, and let her rip, sending the ball rocketing over the fence and into a parking lot as the fans cheered and his coach yelled out, ""Attaboy, Nate! Good job, bud!"" His team, the Lil' Padres, attempted to hoist him up on their shoulders in celebration of their great victory over the favored Tiny Tigers, but were unable to pick up the large 230-pound man. Ripley's feat comes at the end of a momentous tee-ball season, in which the self-identified six-year-old absolutely shattered every record set prior to that point. With a 1.000 batting average, 52 home runs, and an incredible showing at first base, second base, shortstop, third base, and pitcher, the man is being called an inspiration to other six-year-olds everywhere. ""I'm just proud to be here with my team. It's all for the love of the game,"" an emotional Ripley told reporters while enjoying an orange slice and juice box after the championship. ""I couldn't have done it without my team.""",The Babylon Bee,2017-06-06,Lifestyle,Satire,https://babylonbee.com/news/man-identifying-6-year-old-crushes-game-winning-homer-tee-ball-championship
8,Biden: 'I Am The Only Candidate Who Can Beat Ronald Reagan',"HOUSTON, TX - Fresh off his afternoon nap, presidential candidate Joe Biden gave a fiery, high-energy speech in Houston today, claiming to be the only candidate who could beat incumbent Ronald Reagan. ""I am the only candidate who can unite the party to defeat Reagan,"" he said to scattered applause. ""When Super Thursday hits here in a few weeks, we can rally the 150 million Democrats here in the great country of Texas to vote for me so we can get Reagan and his crony Dick Cheney off the Iron Throne there in the Imperial Senate. Go Hoosiers!"" Aides scrambled to turn off Biden's mic but he beat them away with his walker. ""The time has come for the reign of Tippecanoe and Tyler too to end!"" he shouted, though by this point he had wandered into a nearby field and no one could hear him.",The Babylon Bee,2020-03-02,Politics,Satire,https://babylonbee.com/news/biden-i-am-the-only-candidate-who-can-beat-ronald-reagan
9,Fisher-Price Releases 'My First Peaceful Protest' Playset With House You Can Actually Burn Down,"EAST AURORA, NY - The toy geniuses at Fisher-Price have announced a brand new toy made just for leftist parents and their kids: the My First Peaceful Protest playset. The kid-size clubhouse will come with several varieties of spray paint so kids can tag the tiny building with their own empowering slogans. It will also be made out of cardboard, allowing the cute little tikes to burn the whole thing down if their demands are not met. ""Here at Fisher-Price, we are steadfastly committed to social justice,"" said toy designer Camden Flufferton. ""We need to teach our kids what democracy looks like, and there's no better example of democracy in action than violent vandalism and arson. We hope this new playset will serve as an inspiration for parents wanting to teach their kids how to threaten citizens with violence whenever their demands are not met."" The set will also come with toy televisions, cell phones, jewelry, and clothing, allowing kids to simulate looting before they torch the entire set. The set will be available in stores for $399 because of capitalism. Experts are questioning the wisdom of this move by Fisher-Price, mainly because people in the target market don't typically have any kids. ""We know we'll probably only sell, like, 3 of these,"" said Flufferton, ""but selling them isn't the point. We just need you to know we're on the right side of history.""",The Babylon Bee,2020-09-21,Politics,Satire,https://babylonbee.com/news/fisher-price-introduces-supreme-court-protest-playhouse-that-can-be-vandalized-and-burned-down


**The Daily Squib**

100 articles were taken from the "Most Popular" page: https://www.dailysquib.co.uk/category/most-popular

In [32]:
def scrape_squib_article(url):
    """
    Scrapes an article from a given URL on dailysquib.co.uk and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date        
        published_meta = soup.find("meta", property="article:published_time")
        if published_meta and published_meta.get("content"):
            article_data["date"] = published_meta["content"].split("T")[0]
        
        # Category
        category_div = soup.find("div", class_="tdb-category td-fix-index")
        if category_div:
            cat_links = category_div.find_all("a", class_="tdb-entry-category")
            if cat_links:
                categories = [
                    #ignore "most popular"
                    a.get_text(strip=True) for a in cat_links if a.get_text(strip=True).lower() != "most popular"
                ]  
                #if multiple categories, return the first
                article_data["category"] = categories[0]
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"

        # Extract the article text
        content_div = soup.find("div", class_="td-post-content")
        
        if content_div:
            # remove blockquotes (e.g. embedded tweets)
            for bq in content_div.find_all("blockquote"):
                bq.decompose()
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = clean_text(full_text.strip())
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# List of URLs to scrape
urls = get_urls_from_txt("squib.txt")
# Scrape articles and create a DataFrame
squib_data_df = scrape_multiple_articles(urls, scrape_squib_article)
# Store to CSV
squib_data_df.to_csv("satire_scraped_articles_squib.csv", index=False)
# Print df 
squib_data_df

Unnamed: 0,title,text,site,date,category,class,url
0,Let Putin Celebrate His Pyrrhic Victory,"For the sake of peace, and a chance for Russia to rearm before the next offensive; it is best to let Putin celebrate his pyrrhic victory. Let the Russians have their deluded fun. The meat grinder sometimes needs a little rest, some oiling before more Russian drone-fodder are sent into the muddy fields to have their bodies mutilated. The deluded Russians can wallow in their false success in ""winning"" the war in Ukraine, but first they're going to have to bury what's left of their army. Although the Kremlin will never reveal the death statistics, many experts estimate that the number is up in the millions. Yes, of course, the poor Ukrainians defending their land also died, but certainly the numbers will not be as high as the Russians, simply because they did not have enough men to throw into the meat grinder — not as much as the Russians. Lessons learned? Naturally, no one, especially the Russians, will learn from any of this, and the same thing will be repeated ad infinitum. Luckily, for everyone, Biden and Obama are not around any more to fuck things up again. The truth is, Putin wants Europe, and it seems the weak socialists of the EU will eventually bend over the table when the time comes and take one up the rectum for Putin with not even a tiny whimper or utterance of discomfort, such is their submissive cowardice. Once a surrender monkey, always a surrender monkey. Maybe the EU could send their Eurovision Song Contest applicants to the front lines to serenade the next wave of amphetamine drugged Russian conscripts as they push forward with blind gusto. No more help from Uncle Sam.",Daily Squib,2025-02-24,World,Satire,https://www.dailysquib.co.uk/world/60538-let-putin-celebrate-his-pyrrhic-victory.html
1,Bonus Good News For American Men - Unhinged Batsh*t Crazy Liberal Women Going Celibate,"Sometimes good news comes in floods of joy and happiness, and this is the case for American men as the batshit crazy, deranged blue-haired pierced and heavily tattooed liberal narcissistic women who voted for Kamala Harris have vowed to go celibate and abstain from men. It's double plus good news for the gene pool as these selfish, entitled, mentally ill, attention seeking harridans won't thankfully reproduce. ""For men, this makes things easier by taking psycho hose beasts out of the equation, only leaving the good women. The Trump win has truly been a wondrous event, as it effectively cleanses America of the shitty things, and brings back some form of goodness and purity to the country,"" a college student from Nebraska revealed. These liberal women have been so ideologically brainwashed and mentally damaged by the Kamala Harris campaign that they are going on TikTok to publicly display their total and utter shameful, demoralised mental conditions of utter indignity to everyone. The ironic thing is, these dumbos only sleep with liberal men who voted for Kamala as well, so they're taking them out of the gene pool too — triple fucking bonus. No need for your much-loved right to have abortions any more — quadruple fucking bonus!!!",Daily Squib,2024-11-08,Entertainment,Satire,https://www.dailysquib.co.uk/entertainment/58255-bonus-good-newsfor-american-men-unhinged-batsht-crazy-liberal-women-going-celibate.html
2,"Rachel Maddow Concerned Trump Will Put Her in ‘FEMA Camp’ During Second Term: ‘Yes, I’m Worried’","Clinically insane MSNBC host Rachel Maddow has expressed serious fearmongering concerns that she and millions of other clinically insane American woke socialist liberals would be interned in a ""camp"" when former President Donald Trump wins a second term in the White House this November. During an interview with Maddow in Monday's Unreliable Sources newsletter, CNN Democrat propagandist Benson Burner asked the MSNBC host about her concerns about being targeted during a second Trump administration. ""Trump and his allies are openly talking about doing the same thing the Democrats have unjustly done to him. Weaponising the government to seek revenge against critics in media and politics, with some of his extremist allies even talking about jailing the 'treacherous and treasonous scum',"" noted Burner. ""You're one of his most notable critics on television. Are you worried that you could be a target?"" Rachel Maddow replied: ""I'm worried, but I'm actually ready for being in a camp because I'm a bleeding heart liberal propagandist for the Democrat Party. I don't read the news in an objective fashion or without obvious bias in any way. When Trump invokes the Insurrection Act to deploy the U.S. military against civilians on his first day in office, I will be cheering like a cheer leader because it plays into my perpetual liberal victim state of mind. Also, when Trump imprisons me in a FEMA concentration camp, I will be able to play the part of martyr, and virtue signal to my fellow liberal socialist Americans of my suffering for the cause of socialism and communism in America. ""When Trump puts millions of blacks, criminals, Mexicans, gays, and migrants into the concentration camps, I will be happy because it would have proved my point that my scaremongering before the election actually did not work and millions of Americans voted for Donald J. Trump anyway. ""In the camps of the future it won't be so bad either, there will be plenty of women for me to become friends with. I hope Trump puts me in one of those all female camp buildings. I'll be up to my eyeballs in pussy. Really, there's nothing to fear folks, everything I say on MSNBC is absolute bullshit, and I am essentially an actor. Anyone who takes me seriously must be as mad as I am."" The 'Trump Derangement Syndrome' seems to be in full force before the coming U.S.elections.",Daily Squib,2024-06-12,World,Satire,https://www.dailysquib.co.uk/world/56316-rachel-maddow-concerned-trump-will-put-her-in-fema-camp-during-second-term-yes-im-worried.html
3,I'm a Cruise Ship Worker...These are the SIX Things Smart Passengers Always do Onboard,"A cruise ship worker has shared the six things that 'smart' passengers always do when they come aboard for their holiday. Janice Munklehouse, 21, from Grimisbury on Thames, Lincolnshire worked on cruise ships for 38 years, and regularly shares advice for passengers and crew on her Cruising Da Seas Innit YouTube channel. Now, she has shared how passengers can make the most of their voyage with her own expert tips. Read on below for the six top tips she had for cruise passengers. Don't fall overboard whilst boarding the ship Kicking off with the first piece of advice to get ahead of the game while travelling on a cruise, Janice said people should always avoid falling overboard whilst boarding the cruise ship. She said that, it is best to arrive at the destination of your cruise embarkation point ""as early as possible"", suggesting a late arrival may mean you rush to board the ship loaded with luggage and fall into the sea. While the cruise ship worker said that many people avoid drowning or being crushed by other boats in the harbour, she has seen many passengers falling off the boarding ramp in a rush to board the ship. ""One fella comically tried to jump over 17 feet because he was late for the cruise. He was summarily chewed up in the propellers, but his cruise ticket and hairbrush were thankfully recovered by a fisherman 6 months later in the stomach of a Mako shark."" Don't jump off the ship when it's cruising Moving on to her next tip, Janice encouraged passengers to not jump off the ship when it was cruising in the ocean. Janice said jumping off the cruise ship whilst moving would not be very nice for the passenger. She said that: ""Jumping into the ocean from your ultra-luxury cruise line would mean that you miss a lot of activities on board like karaoke, disco night and various themed buffets"". She said that, while it may be tempting to jump from the deck of your cruise ship to avoid all the old codgers and nouveau riche arseholes, bad things could happen to you in the open ocean hundreds of miles from shore. Don't fall off the ship The cruise worker highlighted the potential dangers of falling off the ship. If passengers wear inadequate footwear or lean over the edge of the ship, they could slip and fall into the ocean, which would not be such a good cruise ship experience. Don't get pushed off the ship Janice's fourth piece of advice for how to be a smart passenger when going on a cruise concerned not being pushed off the ship. She urged anyone looking to enjoy their holiday cruise to avoid being pushed over the side of the ship into the ocean by some nutter on board or an angry crew member. During her 38 years working on cruise ships, Janice says she has witnessed a number of crew members who have had to push annoying passengers overboard for being arseholes. She warned that many passengers end up in the ocean because the crew have simply had enough. There's only so much a crew member can take, so don't be an arse and get on everyone's tits. Don't fall out of your cabin window The cruiser worker's penultimate tip of six urged holidaymakers not to open the cabin window and hang over the edge, especially when drunk out of their minds on cheap watered-down booze. Elaborating on this point, she suggested it would always be best to simply use the cabin window to look at the ocean and not to jump into the water from it. Janice remarked: ""You may run the risk of getting eaten by sharks, or simply drowning"". Don't fall off the ramp whilst disembarking Her final piece of advice for how to be a smart passenger when disembarking a cruise ship concerned not falling off the ramp into the polluted, dangerous waters at port. Janice suggested this could be potentially very dangerous for the passenger. She said: ""The reason you want to decide on these things before you disembark is because if you fall overboard, things could get rather messy. I once saw an elderly passenger ingest an entire humungous turd in one gulp when she fell in the water near a sewage outflow pipe"". The veteran cruise worker also said that avoiding falling or jumping off a cruise ship should be a priority for most passengers, but naturally some were just born to do it. Once passengers disembarked, Janice admitted she and the crew were glad all the whinging fuckers had gone and hoped they would never be seen again — that is, until the next batch boarded the cruise ship for a trip they would surely regret for a lifetime.",Daily Squib,2024-05-12,Entertainment,Satire,https://www.dailysquib.co.uk/entertainment/55825-im-a-cruise-ship-worker-these-are-the-six-things-smart-passengers-always-do-onboard.html
4,Labour Plan to Have Speakers and Listening Devices on Every Lamp Post,"The public will be lectured on ""motivational"" socialist principles, and Labour diktats every day of their lives, a newly published manifesto paper has revealed. Speakers and microphones will be installed in every lamp post in the UK proposed by the innovative Labour plan. Much like the Mayor of London, Sadiq Khan's pet surveillance project, ULEZ, where every vehicle is tracked and charged in London, the proposed Labour ""Listen and Speak"" scheme will ensure that citizens are daily indoctrinated in soviet ideology and microphones will listen for any form of dissent against the ruling regime once it gains power in the coming election. LISTEN AND SPEAK Upon hearing of the Labour plans, one citizen voiced his distaste of such a scheme coming into fruition. ""If I want to live in fucking North Korea, I'll go and live there. Imagine walking down the street and being forced to listen to the irritating nasal droning from Comrade Starmer every fucking day of your life, listening to his awful grinding punishing nasal voice telling you how to think, what to do, where to go, I'd fucking top myself."" Along with daily lectures on the greatness of Labour socialist schemes, citizens will be indoctrinated in EU values and other communist rhetoric. If people are seen to be wearing headphones whilst walking in the streets, they will be told to take the device off from their heads, or if they are Bluetooth headphones, the Labour ""Listen and Speak"" system will hijack the Bluetooth headphones to force the citizen to listen to the latest soviet Labour messages being broadcast. Along with speakers installed on every lamp post, the Labour plan is to also install powerful microphones that will monitor each citizen's speech. The listening devices will be powered by AI and will alert the Labour Stasi authorities if any citizen speaks adversely about the Labour regime at any time or says any word that is forbidden by woke programmers who have infiltrated the English language. ""If someone says any forbidden words or speaks badly about the Labour soviet system, the AI system will identify the offender, who will then be removed from their home in the early hours of the morning. These offending individuals will then be sent to an EU sanctioned re-education centre and reprogrammed to love the Labour EU State.,"" a jubilant Labour spokesman revealed on Thursday. Comrade Starmer pronounced the scheme as a measure to ""safeguard and ensure the safety of every British citizen"" and a way to uphold ""the beloved EU rules which Labour is dedicated to rejoining"".",Daily Squib,2024-05-30,World,Satire,https://www.dailysquib.co.uk/world/56106-labour-plan-to-have-speakers-and-listening-devices-on-every-lamp-post.html
5,Personal Computers and Smartphones Were Introduced For Benefit of AI,"You may think there were some benevolent reasons for rolling out and introducing the personal computer and smartphone to the civilian population, and yes there were some, but the majority of reasons were far from that. One has to understand that the controllers work in 50 and 100 year increments and their primary modus operandi is one that may either surprise you or deep down your sub-conscious probably already knows what is going on. What is the last bastion of control over the human population? The inside of your mind, of course, this has eluded the controllers for centuries. Even the Spanish Inquisition could not get close to that level of knowledge or control, or the Nazis, or the Soviet communist dictators or the religious organisations. As a control system, religion has wavered and is not as powerful a tool for complete control any more, and this is why the controllers needed access to your most intimate thoughts, your thought cycles, as well as your very methods of thought. This process would need machines that replicate human thought to some extent, and what better way than a personal computer touted as a way to enhance human activity. Even programming languages effectively replicate human thought processes to some extent with variables, strings and multiple processing architectural archetypes that are the basic structure of 1010101, the universal on-off switch for every permutation of every possible combination of mathematical and human existence. Who's programming who, the human on the computer, or the computer on the human? Essentially speaking, the personal computer introduced to the public was a first major step into delving into the minds of the population, giving the controllers a basic map of the internal minds of humans. The next step was connectivity, and this is when the DARPA project of an internet was introduced to the general public. All of this trained internal data had to move around, it had to evolve and of course it had to be collected and filed by the controllers in their massive database banks. The internet allowed the controllers to see what people liked/disliked, it allowed them to delve into the darkest secrets of human activity as well as the thought processes and decisions people made in their lives. Every single facet of human behaviour was intricately analysed, logged and filed and in the present time it still is right now. The smartphone was then introduced as an additional form of ultimate human control. This technique was a goldmine of information for the benefit of AI systems because it formed a much more intimate picture of human activity and behaviour simply because of its small size. A mobile phone is easy to carry and is with humans pretty much all the time, whereas a bulky personal computer is generally not with a person at all times. The vast amount of data collected through this method is too vast to even comprehend for most people, but smartphones along with things like apps are a vast treasure trove of data helping the controllers map out the human brain and its collective methods. People cannot do without their smartphones now, they are totally addicted and attached to them. Studies have now shown that by taking smartphones away from some people who are then put into a room alone, results in them self-harming themselves, such is the level of control over their entire being and mind. Human data to benefit of AI systems AI will fully understand and replicate the human mind. It does not need to sleep, it does not need wage rises or maternity leave or holidays. There are no industrial disputes with AI, there are no sick days or loss of productivity. This is why AI was fed the entirety of human data because humans will be replaced by these systems soon enough as is the plan by the controllers. To fully control something, first you must completely understand every facet of it. Mapping out every single dendrite, synapse and connection of the human brain is another major project currently underway. What do you do with the entire data set of the human mind; the books, the literature, the behaviour analytics, the thought processes, the creativity, the emotions, the biases, the infinite variables and combinations of discourse etc.? You feed it all into AI machine learning projects, and this is the key factor in all the control processes envisioned by the controllers. This is all set up for the benefit of AI systems. For thousands of years they have dreamed of this very moment because they have been the few and the population has been the many, they have feared greatly of losing their grip on humanity, of losing their position of control. This is why companies like Apple recently produced an advert for their new iPad Pro with an M-chip that depicts the entire breadth and width of human culture, creativity, and art being crushed by a rubbish disposal machine. This depiction signals the final rallying call that machines have ultimately superseded the human experience and this is just the beginning of the end for the traditional biological state of humanity. In the future, when the brain chip is introduced to the entire population, it will be the final step of ultimate control. By then they will have mapped out the human brain in its entirety, and the controllers will gain direct access into every single thought and memory of each human. As is today, humans cannot function in business or anything without a smartphone, and this will be the method used for the brain chip as well. Elon Musk, a sinister deviant character, is tasked with the initial rollout of this technology, but there are others in the pipeline right now as well if he fails. Ultimately, humanity is on the cusp of a major epoch regarding the benefit of AI systems, a time of change so extreme that may bring back a state of feudalism once again but this time it will be an all encompassing form of technological feudalism and slavery incorporating complete control of the last bastion of human control — the brain. UPDATE – May 24 Looks like others are realising what the Daily Squib has been talking about for years. https://www.barrons.com/news/ai-relies-on-mass-surveillance-warns-signal-boss-20280d0a",Daily Squib,2024-05-11,Sci/Tech,Satire,https://www.dailysquib.co.uk/sci_tech/55812-personal-computers-and-smartphones-were-introduced-for-benefit-of-ai.html
6,EU Breathes Sigh of Relief as Moderate Socialists Secure German Election,"Lucky for the EU, the AfD (Alternative for Deutschland) political party was suitably suppressed and instead the moderate socialist party of Germany, Christian Democratic Union [CDU] party won the election on Sunday night. With a substantially declining economy, Germany's heyday seems to have come and gone and Friedrich Merz, the chancellor-in-waiting after winning elections on Sunday, will most probably continue to steer the country's decline as well as gloss over the increasing tensions over mass unfettered immigration into the EU zone. ""It is imperative that the EU nations are weakened further by mass unfettered immigration from Islamic countries and the Third World, and that daily terrorist attacks continue on innocent German civilians so that more fear and terror is perpetuated within EU nations and their indigenous populations are fractured further. Patriotism within EU nations is being destroyed more thoroughly every year, but substantially more damage has to be implemented. For full integration into a single soviet communist bloc, all vestiges of former national identity and racial purity must be destroyed. The DNA of white Europeans must be muddied and defiled to such a level that all those blonde, blue-eyed beautiful women will transform into something completely different. Look at Sweden, where they are being raped, stabbed, shot daily. This is the plan,"" an unnamed unelected intermediary revealed on Monday.",Daily Squib,2025-02-24,World,Satire,https://www.dailysquib.co.uk/world/60528-eu-breathes-sigh-of-relief-as-moderate-socialists-secure-german-election.html
7,Our Beloved Tech Pariah Police State UK Removes Crucial Encryption From Apple Devices - For Your Own Safety!,"Comrades, we are happy to announce that all your previously encrypted data on Apple devices will be freely viewable by any Labour government official, thoughtcrime Stasi officer, or any third party/hackers. The People's Republic of Soviet Britain has ordered the vile capitalist American tech company Apple to take away any form of encryption defence for their devices, so all your data will be viewable by anyone who wants to view it. FOR YOUR OWN SAFETY Yes, this new Labour Directive is for your own safety, as we will arrest you for thoughtcrimes that you may utter that may be harmful to yourself and consequently to the Big State. We thank the socialist Conservative government who created this directive before we assumed power. Commissar Yvette Cooper, our chief Stasi Thoughtcrime Officer, has kindly arranged for the entirety of your private data to be transferred periodically to our own databases where it will be monitored and examined with a digital microscope. The stupid capitalist democratic Americans have threatened the UK. They say our removal of encryption poses a serious a threat to American national security and that the US government should re-evaluate its intelligence-sharing agreements with the UK. The People's Republic of Soviet Britain does not take orders from capitalist swine, and we do not care about the USA. Fuck America! REMEMBER, BY TAKING AWAY ENCRYPTION, HACKERS WILL BE ABLE TO ENTER YOUR APPLE DEVICE'S DATA WITH EASE – HAVE A NICE DAY! Comrade Starmer and Commissar Reeves, who recently visited fellow communist state China, were given tips on how to implement the same Police State principles they utilise. Double plus good news, comrades! Biometric face recognition has been implemented in Cardiff, Wales, and will be rolled out to the entirety of the UK soon. The Stasi Thoughtcrime Ministry will also implement a Chinese Soviet style Citizen Rating System (social credit system) that will punish citizens who commit thoughtcrimes against the Big State by limiting their freedoms even further, and in some cases imprisoning them in gulags or detaining them in Labour Woke Re-education Facilities. INGSOC NOTICE 4532652-239187677818223-4313-923-2-32-2-1-234-1 CITIZEN MARY SMITH, 12, WAS AWARDED 2.1 GRAMS OF SUGAR RATION INCREASE FOR THE MONTH OF MARCH FOR REPORTING THOUGHTCRIMES COMMITTED BY HER 8-YEAR-OLD BROTHER, HER GRANDMOTHER AND HER FATHER. THE THOUGHTCRIMES WERE COMMITTED AWAY FROM THEIR APPLE IPHONE DEVICES AND THE BIG STATE COULD NOT MONITOR THEIR ACTIVITIES AT THE TIME. THE THOUGHTCRIMINALS WERE REMOVED FROM THEIR PLACE OF ABODE AND HAVE NOW BEEN PERMANENTLY ERASED/CANCELLED. REMEMBER COMRADES, THE BIG STATE IS ALWAYS WATCHING BUT IF YOU SEE, HEAR ANYTHING, REPORT IT IMMEDIATELY TO YOUR LOCAL STASI AGENT WITHOUT DELAY. THIS IS FOR YOUR OWN SAFETY!",Daily Squib,2025-02-22,World,Satire,https://www.dailysquib.co.uk/world/60493-our-beloved-tech-pariah-police-state-uk-removes-crucial-encryption-from-apple-devices-for-your-own-safety.html
8,That Old Obama Birth Certificate Issue Resurfaces Once Again,"Oh dear, poor old Barry, looks like he can't get much of a break these days. There are renewed calls once again to re-investigate his birth certificate, which may finally prove that he was actually born in Kenya and not in the United States. After two overt terms in office, and one covert term during Biden's presidency, Obama has already been through the process of governance, and frankly, it seems a little too late to be messing with this issue again. If fraud is found regarding the Obama birth certificate, what would be the repercussions? https://obamawhitehouse.archives.gov/sites/default/files/rss_viewer/birth-certificate-long-form.pdf Open with Photoshop or Illustrator, or Affinity Photo to see the various layers of the editable document.",Daily Squib,2025-02-22,World,Satire,https://www.dailysquib.co.uk/world/60519-that-old-obama-birth-certificate-issue-resurfaces-once-again.html
9,Zelensky Needs to Wear a Suit Says Tailoring Expert,"A tailoring expert at the famous Savile Row street in Mayfair, London has revealed that if Ukraine's president Volodymyr Oleksandrovych Zelensky wore a suit like Vladimir Putin, he might be taken a little more seriously. ""Trump is a businessman. He just sees dollars, nothing else. Everything is monetary to him, especially war. That's why Zelensky needs to start wearing a suit pronto if he wants to keep his job fighting the Russians. Maybe get a few gold taps. Look at Putin with his immaculate tailored suits, prancing around his gold-plated palaces. He can murder 15,000 civilians in a morning, and no one will bat an eyelid because of his elegant suit,"" the tailor revealed. A gentleman is defined by their tailoring, irrespective of their actions on the battlefield or in the boardroom. Putin is no angel, but neither is Zelensky. Wars are not fought by angels. All those billions that Joe Biden threw at Zelensky, perpetuating the war in Ukraine, he of course did not siphon a few dollars here or a few dollars there into his own accounts in Switzerland? Maybe Keir Starmer could donate one of his freebie ill-appropriated designer suits to the Ukrainian president? ""We are sure Zelensky can afford to buy a bespoke tailored suit made from the finest materials,"" the tailor added. It's all getting a bit dodgy now, innit.",Daily Squib,2025-02-20,World,Satire,https://www.dailysquib.co.uk/world/60456-zelensky-needs-to-wear-a-suit-says-tailoring-expert.html


**Waterford Whispers**

100 arcs were take from the homepage (https://waterfordwhispersnews.com/), sorted from most recent to least recent.

In [None]:
def scrape_whispers_article(url):
    """
    Scrapes an article from a given URL on waterfordwhispersnews.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire",
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date 
        date_div = soup.find("div", class_="post-date", itemprop="datePublished")
        if date_div:
            article_data["date"] = date_div.get_text(strip=True)
        else:
            article_data["date"] = "Date not found"
 
        # Category (excluding the ones used just for web display)
        excluded_categories = {"breaking news", "featured-one", "featured-two", "featured-three","homepage"}
        category_div = soup.find("div", class_="post-category")
        if category_div:
            all_cats = [a.get_text(strip=True) for a in category_div.find_all("a")]
            valid_cats = [cat for cat in all_cats
                          if cat.lower() not in excluded_categories]
            if valid_cats:
                article_data["category"] = valid_cats[0]
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"

        # Article copy
        content_div = soup.find("div", class_="article-content", itemprop="articleBody")
        if content_div:
            for p_tag in content_div.find_all("p"):
                p_text = p_tag.get_text(strip=True).lower()
                # remove marketing snippets
                if "check out our shop." in p_text or "www.waterfordwhispers.shop" in p_text or "buy some of our merch here" in p_text or "help us to keep pissing off all the right people" in p_text:
                    p_tag.decompose()

            # remove blockquotes
            for bq in content_div.find_all("blockquote"):
                bq.decompose()

            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = clean_text(full_text.strip())
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# List of URLs to scrape
urls = get_urls_from_txt("whispers.txt")
# Scrape articles and create a DataFrame
whispers_data_df = scrape_multiple_articles(urls, scrape_whispers_article)
# Store to CSV
whispers_data_df.to_csv("satire_scraped_articles_whispers.csv", index=False)
# Print df 
whispers_data_df.head()

Failed to fetch the webpage: https://waterfordwhispersnews.com/2025/02/21/is-parenting-a-private-matter-or-jesus-jane-has-no-control-over-her-little-tearaways-at-all-at-all/. Status code: 202
Failed to fetch the webpage: https://waterfordwhispersnews.com/2025/02/20/new-champions-league-format-slammed-for-including-low-quality-minnows-like-man-city/. Status code: 202
Failed to fetch the webpage: https://waterfordwhispersnews.com/2025/02/19/cultural-appropriation-samoan-tribesman-criticised-for-getting-irish-harp-tattoo/. Status code: 202


In [None]:
# Combine DataFrames
satire_dataset = pd.concat(
    [whispers_data_df, squib_data_df, bee_data_df, onion_data_df],
    ignore_index=True
)

# Basic checks
print(satire_dataset.info())   # Data types & non-null counts
print(satire_dataset.head())   # Quick glance at first rows

# Print out the categories
print(satire_dataset["category"].value_counts())

# Confirm 4 sites are represented
print("Number of unique sites:", satire_dataset["site"].nunique())

#Check if any empty articles
empty = polarised_dataset[polarised_dataset["text"]=="Article text not found"]
print (empty)

In [None]:
def clean_category(cat: str) -> str:
    """
    Convert categories to lowercase, unify synonyms, and return a single standardised category.
    """
    # Convert to lowercase
    c = cat.strip().lower()
    
    # Standardise categories
    replacements = {
        'politics': 'politics',
        'local news': 'local',
        'world news': 'world',
        'world': 'world',
        'worldviews':'world',
        'entertainment': 'entertainment',
        'business': 'business',
        'health': 'health',
        'lifestyle': 'lifestyle',
        'life':'lifestyle',
        'sports': 'sports',
        'sport': 'sports',
        'football': 'sports',
        'gaa': 'sports',
        'sci/tech': 'technology',
        'celebs':'entertainment',
        'tech': 'tech',
        'u.s.':'united states',
        'uplifting viral content': 'entertainment', 
    }

    # Do the changes
    if c in replacements:
        c = replacements[c]
    return c

# Apply the cleaning
satire_dataset['category'] = satire_dataset['category'].apply(clean_category)

# Now check the new distribution
print(satire_dataset['category'].value_counts())

In [None]:
# Store to CSV
satire_dataset.to_csv("satire_articles.csv", index=False)

### 5. Commentary
Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding or presenting mainly personal interpretation.

##### Features:

- Personal Interpretation: The writer’s subjective opinions or experiences form the core of the content.
- Limited Fact-Checking: Minimal reliance on verified data; opinions may be framed as personal reflections or “takes.”
- Editorial or Opinion Section: Typically appears in editorial pages, op-eds, blogs, or similar formats clearly labeled as opinion.

##### Label If:

- The text is primarily an opinion piece discussing how the author feels about an event, topic, or policy.
- The author uses subjective language (e.g., “I believe…,” “In my view…”) rather than objective reporting.

##### Do Not Label If:

- The commentary deliberately misrepresents facts to persuade or manipulates partial truths (label as Polarised).
- The commentary is disguised marketing or propaganda with a clear persuasive goal (label as Persuasive).

##### Sources:
- https://www.wsws.org/en/topics/site_area/perspectives
- https://www.huffpost.com/section/opinion
- https://www.nytimes.com/international/section/opinion
- https://www.washingtonpost.com/opinions/
- https://www.theguardian.com/uk/commentisfree
- https://www.nature.com/nature/articles?type=editorial

In [None]:
#https://www.washingtonexaminer.com/opinion/columnists/3318785/complicated-story-iron-mountain/

urls = [
    'https://www.washingtonexaminer.com/opinion/columnists/3311520/trump-rides-the-vibes/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3311291/ignore-trumps-gaza-distraction-focus-iran/',
    'https://www.washingtonexaminer.com/restoring-america/faith-freedom-self-reliance/3311173/how-trump-and-doge-should-reform-social-security-administration/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3154264/universities-need-more-change-than-neutrality/',
    'https://www.washingtonexaminer.com/op-eds/3311024/rubio-must-reverse-the-biden-administrations-designations-of-us-allies/',
    'https://www.washingtonexaminer.com/opinion/editorials/3310152/rubio-good-start-blunting-chinese-influence-panama/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3310965/trump-endangers-american-interests-gaza-ownership-plan/',
    'https://www.washingtonexaminer.com/opinion/3310524/early-look-virginia-governor-race/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3310112/trump-shows-tariffs-work/',
    'https://www.washingtonexaminer.com/restoring-america/patriotism-unity/3310394/ending-plunder-grift-usaid/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3310324/democrats-abortion-cult-getting-more-morbid/',
    'https://www.washingtonexaminer.com/opinion/3310165/sean-parnell-once-again-answers-call-serve/',
    'https://www.washingtonexaminer.com/in_focus/3309253/trump-can-reset-relations-iran-mullah-regime/',
    'https://www.washingtonexaminer.com/opinion/3309085/trump-can-help-prevent-aviation-disasters/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3309352/partners-can-help-us-thwart-russia-iran-axis/',
    'https://www.washingtonexaminer.com/restoring-america/fairness-justice/3309410/violent-climate-action-not-free-speech/',
    'https://www.washingtonexaminer.com/opinion/3322085/trump-should-help-save-ss-united-states/',
    'https://www.washingtonexaminer.com/restoring-america/courage-strength-optimism/3321134/trump-latin-america-realignment-puts-america-first/',
    'https://www.washingtonexaminer.com/restoring-america/faith-freedom-self-reliance/3319748/democrats-backed-doge-until-trump-took-over/',
    'https://www.washingtonexaminer.com/op-eds/3321081/trump-should-call-new-elections-georgia/',
    'https://www.washingtonexaminer.com/restoring-america/fairness-justice/3315981/trump-should-follow-florida-virginia-models-criminal-justice-reform/',
    'https://www.washingtonexaminer.com/opinion/3313825/doge-root-american-values/',
    'https://www.washingtonexaminer.com/restoring-america/community-family/3312627/kelly-loeffler-is-the-champion-small-businesses-need/',
    'https://www.washingtonexaminer.com/restoring-america/community-family/3310424/halt-fentanyl-act-gives-americans-hope/',
    'https://www.washingtonexaminer.com/restoring-america/patriotism-unity/3309280/new-day-border-security-rule-of-law/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3321972/conception-begins-at-erection-absurd-theater-left/',
    'https://www.washingtonexaminer.com/restoring-america/community-family/3321936/problem-artificial-womb/',
    'https://www.washingtonexaminer.com/opinion/editorials/3321826/europe-cannot-handle-truth-free-speech/',
    "https://www.washingtonexaminer.com/opinion/columnists/3318785/complicated-story-iron-mountain/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3318406/disney-families-over-politics/",
    "https://www.washingtonexaminer.com/opinion/columnists/3318333/why-were-hopes-1990s-dashed/",
    "https://www.washingtonexaminer.com/in_focus/3318161/many-problems-trump-gaza-plan/",
    "https://www.washingtonexaminer.com/restoring-america/3317773/trump-transgender-order-win-for-religious-liberty/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3317606/journalists-discover-the-constitution/",
    "https://www.washingtonexaminer.com/restoring-america/faith-freedom-self-reliance/3317637/lori-chavez-deremers-pro-union-stance-makes-her-poor-choice-labor-secretary/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3317675/democrats-activist-judges-against-democracy/",
    "https://www.washingtonexaminer.com/restoring-america/patriotism-unity/3317626/fort-bragg-latest-name-change-pure-political-appeasement/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3317577/trump-must-stop-uks-dangerous-surrender-chagos-islands/",
    "https://www.washingtonexaminer.com/restoring-america/fairness-justice/3317053/shein-and-temu-must-be-restricted-over-slave-labor/",
    "https://www.washingtonexaminer.com/restoring-america/courage-strength-optimism/3316984/wed-texas-leads-in-securing-the-border/",
    "https://www.washingtonexaminer.com/opinion/editorials/3317408/trump-right-on-birthright-citizenship/",
    "https://www.washingtonexaminer.com/in_focus/3316988/super-bowl-rings-in-return-americana/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3317019/war-plastic-straws-highlights-democrats-shoddy-science/",
    "https://www.washingtonexaminer.com/opinion/columnists/3317377/beloved-pennsylvanian-returns-home-russian-prison/",
    "https://www.washingtonexaminer.com/restoring-america/fairness-justice/3317026/trump-global-golden-age-religious-liberty/",
    "https://www.washingtonexaminer.com/daily-memo/3317404/biden-economic-hangover/",
    "https://www.washingtonexaminer.com/opinion/columnists/3315442/constitutional-crisis-blame-democrats/",
    "https://www.washingtonexaminer.com/opinion/columnists/3317163/dumbest-immigration-policy-in-world/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3316191/trump-runs-defense-deep-state-mark-zaid-clearance-revocation/",
    "https://www.washingtonexaminer.com/opinion/3316603/jd-vance-trip-shows-confidence-tulsi-gabbard-rfk-jr-confirmation/",  
]

len(urls)
len(set(urls))

def scrape_washexam_article(url):
    """
    Scrapes an article from a given URL on washingtonexaminer.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print(soup)
        
        # Title
        title_meta = soup.find('meta', property='og:title')
        title = title_meta['content'] if title_meta else "Title not found"
        if " - Washington Examiner" in title:
            title = title.replace(" - Washington Examiner", "").strip()
        article_data["title"] = title
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        site = site_name_meta['content'] if site_name_meta else "Site name not found"
        site = site.split(" - ")[0].strip()  # keep only the first part
        article_data["site"] = site
        
        # Published date (from the meta tag)
        pub_date = soup.find("meta", property="article:published_time")
        if pub_date:
            article_data["date"] = pub_date.get("content", "").strip()
        else:
            article_data["date"] = "Date not found"
        
        article_body = soup.find("div", class_="td-post-content")
        if article_body:
            # Remove all <figure> elements.
            for figure in article_body.find_all("figure"):
                figure.decompose()
            # Remove any <a> tag whose text starts with the unwanted phrase.
            for a in article_body.find_all("a"):
                a_text = a.get_text(strip=True)
                if re.match(r"^click\s+here\s+to\s+read\s+more\s+from", a_text, flags=re.IGNORECASE):
                    a.decompose()
            # Extract the text, using a space as separator.
            raw_text = article_body.get_text(separator=" ", strip=True)
            # Replace multiple whitespace/newlines with a single space.
            cleaned_text = re.sub(r'\s+', ' ', raw_text)
            article_data["text"] = cleaned_text
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# Provide a common browser user agent - otherwise the scraping fails
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
    )
}
# Scrape articles and create a DataFrame
washexam_data_df = scrape_multiple_articles(urls, scrape_washexam_article)
# Store to CSV
washexam_data_df.to_csv("commentary_scraped_articles_washexam.csv", index=False)
# Print df 
washexam_data_df

In [None]:
urls = ['https://www.nature.com/articles/d41586-025-00402-x',
        'https://www.nature.com/articles/d41586-025-00282-1',
        'https://www.nature.com/articles/d41586-025-00283-0',
        'https://www.nature.com/articles/d41586-025-00214-z',
        'https://www.nature.com/articles/d41586-025-00213-0',
        'https://www.nature.com/articles/d41586-025-00150-y',
        'https://www.nature.com/articles/d41586-025-00050-1',
        'https://www.nature.com/articles/d41586-025-00049-8',
        'https://www.nature.com/articles/d41586-025-00014-5',
        'https://www.nature.com/articles/d41586-025-00015-4',
        'https://www.nature.com/articles/d41586-024-04159-7',
        'https://www.nature.com/articles/d41586-024-04114-6',
        'https://www.nature.com/articles/d41586-024-04113-7',
        'https://www.nature.com/articles/d41586-024-04046-1',
        'https://www.nature.com/articles/d41586-024-03911-3',
        'https://www.nature.com/articles/d41586-024-03910-4',
        'https://www.nature.com/articles/d41586-024-03932-y',
        'https://www.nature.com/articles/d41586-024-03843-y',
        'https://www.nature.com/articles/d41586-024-03842-z',
        'https://www.nature.com/articles/d41586-024-03753-z',
        'https://www.nature.com/articles/d41586-024-03673-y',
        'https://www.nature.com/articles/d41586-024-03648-z',
        'https://www.nature.com/articles/d41586-024-03585-x',
        'https://www.nature.com/articles/d41586-024-03485-0',
        'https://www.nature.com/articles/d41586-024-03417-y',
        'https://www.nature.com/articles/d41586-024-03418-x',
        'https://www.nature.com/articles/d41586-024-03331-3',
        'https://www.nature.com/articles/d41586-024-03332-2',
        'https://www.nature.com/articles/d41586-024-03266-9',
        'https://www.nature.com/articles/d41586-024-03267-8',
        'https://www.nature.com/articles/d41586-024-03182-y',
        'https://www.nature.com/articles/d41586-024-03183-x',
        'https://www.nature.com/articles/d41586-024-03109-7',
        'https://www.nature.com/articles/d41586-024-03110-0',
        'https://www.nature.com/articles/d41586-024-02992-4',
        'https://www.nature.com/articles/d41586-024-02991-5',
        'https://www.nature.com/articles/d41586-024-02912-6',
        'https://www.nature.com/articles/d41586-024-02913-5',
        'https://www.nature.com/articles/d41586-024-02828-1',
        'https://www.nature.com/articles/d41586-024-02829-0',
        'https://www.nature.com/articles/d41586-024-02757-z',
        'https://www.nature.com/articles/d41586-024-02673-2',
        'https://www.nature.com/articles/d41586-024-02600-5',
        'https://www.nature.com/articles/d41586-024-02533-z',
        'https://www.nature.com/articles/d41586-024-02445-y',
        'https://www.nature.com/articles/d41586-024-02381-x',
        'https://www.nature.com/articles/d41586-024-02314-8',
        'https://www.nature.com/articles/d41586-024-02224-9',
        'https://www.nature.com/articles/d41586-024-02169-z',
        'https://www.nature.com/articles/d41586-024-02080-7'
       ]

def scrape_nat_article(url):
    """
    Scrapes an article from a given URL on nature.com and extracts relevant information.
    Extracts title, text, site, and published date.
    For site name, it falls back to the twitter:site meta tag if og:site_name is missing.
    For date, it attempts to extract from meta tag "article:published_time" first, then JSON-LD.
    For text, it looks for <div class="c-article-body">.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
    }
    
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print (soup)
        # Title: use og:title or fallback to the <title> tag.
        title_meta = soup.find('meta', property='og:title')
        title = title_meta['content'] if title_meta else (soup.title.string if soup.title else "Title not found")
        article_data["title"] = title
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url
        
        # Site name: try og:site_name first; if missing, use twitter:site.
        site_name_meta = soup.find('meta', property='og:site_name')
        if site_name_meta:
            site = site_name_meta['content']
        else:
            twitter_site = soup.find('meta', attrs={'name': 'twitter:site'})
            if twitter_site:
                site = twitter_site['content']
                if site.startswith("@"):
                    site = site[1:]
            else:
                site = "Site name not found"
        article_data["site"] = site
        
        # Published date: first try meta tag; then fall back to JSON-LD.
        pub_date_meta = soup.find("meta", property="article:published_time")
        if pub_date_meta:
            article_data["date"] = pub_date_meta.get("content", "").strip()
        else:
            ld_script = soup.find("script", type="application/ld+json")
            if ld_script:
                try:
                    ld_json = json.loads(ld_script.string)
                    # Sometimes ld+json is a list of objects
                    if isinstance(ld_json, list):
                        ld_json = ld_json[0]
                    if "mainEntity" in ld_json and "datePublished" in ld_json["mainEntity"]:
                        article_data["date"] = ld_json["mainEntity"]["datePublished"]
                    elif "datePublished" in ld_json:
                        article_data["date"] = ld_json["datePublished"]
                    else:
                        article_data["date"] = "Date not found"
                except Exception:
                    article_data["date"] = "Date not found"
            else:
                article_data["date"] = "Date not found"
        
        # Category: extract from <li data-test="article-category">
        cat_li = soup.find('li', attrs={'data-test': 'article-category'})
        if cat_li:
            cat_span = cat_li.find('span', class_='c-article-identifiers__type')
            if cat_span:
                article_data["category"] = cat_span.get_text(strip=True)
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"
        
        # Article body: look for the div with class "c-article-body"
        article_body = soup.find("div", class_=lambda c: c and "c-article-body" in c)
        if article_body:
            # Remove all <figure> elements.
            for figure in article_body.find_all("figure"):
                figure.decompose()
            
            # Also remove header title and teaser text from within article_body (if present).
            header_title_elem = article_body.find("h1", class_="c-article-magazine-title")
            if header_title_elem:
                header_title_elem.decompose()
            teaser_elem = article_body.find("div", class_="c-article-teaser-text")
            if teaser_elem:
                teaser_elem.decompose()
            
            # Now, gather the remaining paragraphs.
            paragraphs = article_body.find_all("p")
            # Also fetch header title and teaser from outside the article body as skip markers.
            header_title = ""
            teaser_text = ""
            ext_header = soup.find("h1", class_="c-article-magazine-title")
            if ext_header:
                header_title = ext_header.get_text(strip=True).lower()
            ext_teaser = soup.find("div", class_="c-article-teaser-text")
            if ext_teaser:
                teaser_text = ext_teaser.get_text(strip=True).lower()
            
            article_paragraphs = []
            for p in paragraphs:
                p_text = p.get_text(separator=" ", strip=True)
                lower_text = p_text.lower()
                # Skip paragraphs that contain the header title or teaser text
                if header_title and header_title in lower_text:
                    continue
                if teaser_text and teaser_text in lower_text:
                    continue
                article_paragraphs.append(p_text)
            
            # Join paragraphs and clean extra whitespace.
            article_text = " ".join(article_paragraphs)
            cleaned_text = re.sub(r'\s+', ' ', article_text)
            article_data["text"] = cleaned_text
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# Provide a common browser user agent - otherwise the scraping fails
headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
}
# Scrape articles and create a DataFrame
nat_data_df = scrape_multiple_articles(urls, scrape_nat_article)
# Store to CSV
nat_data_df.to_csv("commentary_scraped_articles_nat.csv", index=False)
# Print df 
nat_data_df

In [None]:
urls = ["https://www.rollingstone.com/politics/politics-features/baltimore-sun-right-wing-takeover-david-smith-1235268329/"]

def scrape_stone_article(url):
    """
    Scrapes an article from a given URL on rollingstone.com and extracts relevant information.
    Extracts title, text, site, and published date.
    For site name, it falls back to the twitter:site meta tag if og:site_name is missing.
    For date, it attempts to extract from meta tag "article:published_time" first, then JSON-LD.
    For text, it looks for <div class="c-article-body">.
    """
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/115.0.0.0 Safari/537.36"
        )
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the page"}
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Remove advertisement blocks if present
    for ad in soup.find_all("div", class_="admz"):
        ad.decompose()

    # Locate the container that holds the article body.
    article_container = soup.find("div", class_="pmc-paywall")
    if not article_container:
        return {"error": "Article container not found"}
    
    # Find all paragraphs that include the article text.
    paragraphs = article_container.find_all("p", class_=lambda x: x and "paragraph" in x)
    
    # Join paragraphs together, preserving some separation.
    article_text = "\n\n".join(p.get_text(separator=" ", strip=True) for p in paragraphs)
    print("!Huzzah")
    print(article_text)
    
    # Extract other metadata if needed
    title_tag = soup.find("meta", property="og:title")
    title = title_tag["content"] if title_tag and title_tag.has_attr("content") else "Title not found"
    
    published_tag = soup.find("meta", property="article:published_time")
    published_date = published_tag["content"] if published_tag and published_tag.has_attr("content") else "Date not found"
    
    # Return a dictionary with the article details
    return {
        "title": title,
        "text": article_text,
        "date": published_date,
        "url": url
    }


def scrape_multiple_stone_articles(urls):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped data from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_stone_article(url)
        articles.append(article)
    return pd.DataFrame(articles)

# Provide a common browser user agent - otherwise the scraping fails
headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
}
# Scrape articles and create a DataFrame
stone_data_df = scrape_multiple_stone_articles(urls)
# Store to CSV
stone_data_df.to_csv("commentary_scraped_articles_stone.csv", index=False)
# Print df 
stone_data_df