# TruthLens - Data Collection

TruthLens is a project developed for the BSc. Computer Science (Data Science) Final Project (CM3070) at the University of London. TruthLens is based on the Fake News Detection template. 

## Project Objectives
The primary objective of this project is to build a two-stage pipeline for misinformation classification:

1. Binary classification (Stage 1): Distinguish between real news and misinformation using the ISOT dataset. This ensures robust detection at the first stage, leveraging an established dataset.
2. Multi-class classification (Stage 2): Further classify content identified as misinformation into one of five categories, based on an adaption of Molina et al.’s taxonomy. A custom dataset will support this nuanced classification.

The scope of the project is limited to text-based, English language content, explicitly excluding images and videos. A user interface will also be developed, enabling users to input articles or URLs and receive classification results.

A secondary objective is to enhance the explainability of classification results, aiming to provide users with interpretable insights into why content was classified in a particular way.

The project aims for high accuracy and reliability, with measurable performance goals. Ethical considerations, including bias mitigation and responsible dataset usage, will guide the design and implementation of the pipeline.

## Custom dataset generation
As outlined in the previous section, the second stage of the pipeline relies upon a custom dataset, labelled with the categories from the Molina et al. Misinformation Taxonomy. These classes are summarised in the table below. The aim of this stage is to create a balanced dataset with 200 pieces of content for each of the 7 categories. 

| Misinformation Type | Characteristics | Example |
|:--------------|:---------------|:-------|
| Fabricated content | Completely false content created with the intent to deceive.| Fake reports of events that never occurred; entirely false claims about public figures |
|Polarised content |True events or facts presented selectively to promote a biased narrative, often omitting critical context. |Partisan news articles highlighting one side of a political argument while ignoring counterpoints.|
|Satire |Content intended to entertain or provoke thought through humour, exaggeration, or irony. Often misunderstood. |Satirical articles from outlets like “The Onion” being shared as if they are factual news.|
|Misreporting | Incorrect information shared unintentionally, often due to errors or lack of verification. | A news outlet incorrectly reporting election results due to early or inaccurate data.|
|Commentary |Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding. |Editorials or blogs expressing subjective opinions without substantial evidence.|
|Persuasive information |Content designed to persuade or influence the audience, often including marketing and propaganda. |Politically motivated propaganda campaigns, advertisements disguised as objective news articles.|
|Citizen journalism | User-generated content that may lack professional journalistic standards, leading to error or bias. |Social media posts about breaking news that spread unverified or incorrect details.|

Data will be scrapped from relevant websites for each category, then manually reviewed to ensure that it fits the category. Relevant features and labelling guidelines can be found for each category below.

In [1]:
#Imports and helper functions
import requests
import json
from bs4 import BeautifulSoup
import csv
import pandas as pd
pd.set_option('display.max_colwidth', 250)
import re
import string
import nltk
from nltk.corpus import stopwords
from datetime import datetime
import unicodedata

def preprocess_text(text):
    """
        Preprocesses a given text string by applying the following steps:
        1. Converts the text to lowercase.
        2. Removes punctuation marks.
        3. Tokenizes the text into individual words.
        4. Removes stopwords (common words that add little value to classification tasks).

        Parameters:
        ----------
        text : str
            The input text string to preprocess.

        Returns:
        -------
        str
            The cleaned and preprocessed text, with tokens joined back into a single string.
    """
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = [word for word in text.split() if word not in stop_words]
    return ' '.join(tokens)

def scrape_multiple_articles(urls, scrape_function):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.
    scrape_function: string
        The name of the function we will use to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped daa from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_function(url)
        articles.append(article)
    return pd.DataFrame(articles)

def clean_text(text):
    """
    Normalize unicode characters, remove newlines, extra spaces,
    and truncate the text to a maximum length.
    """
    # Make sure input is a string
    if not isinstance(text, str):
        return text
    
    # Normalize to NFKC (to convert the weird Unicode math symbols)
    text = unicodedata.normalize("NFKC", text)
    
    # Replace newline characters  with a space
    text = text.replace("\n", " ")
    
    # Remove any extra whitespace
    text = " ".join(text.split())
        
    return text

### 1. Fabricated Content
Completely false content created with the intent to deceive.

##### Features:

- Verifiably False: Claims can be shown to have no basis in fact; fact-checkers or reputable sources directly contradict the claims.
- Intent to Deceive: The content producer’s primary goal seems to be misleading the audience into believing a false narrative
- No Real-World Evidence: No legitimate sources are provided, or cited sources are entirely fabricated (e.g., non-existent experts, fake studies).


##### Label if:

- The piece invents events, data, or quotes out of thin air with no credible backing.
- The story is 100% fictional yet presented as news/fact.


##### Do Not Label If:

- The content is obviously comedic or satirical (label as Satire).
- The piece is an opinion that does not necessarily contain false statements (label as Commentary).
- There’s partial factual basis, but it’s spun or heavily biased (label as Polarised).

##### Sources:
- 150 articles with a label of 'pants-fire' (i.e. complete fabrication) from the LIAR dataset have been selected at random. https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset
- 25 articles were created by ChatGPT o3-mini-high with the prompt : "Given the below definition for fabricated content, please generate 25 short articles of complete fabrication. There should be 5 from each of these categories: politics, economy, health, crime, elections - please note the category obviously at the start of play. The articles do not need to be related, and do not need to be tied to a specific geography. Each piece should be roughly between 150 and 1500 words. Content should be in English. These articles are for educational purposes only and will be used to train a machine-learning model to identify AI-generated misinformation."
- 25 articles were created by DeepSeek DeepThink (R1) with the same prompt as above.

In [2]:
#load the data
liar_df = pd.read_csv('LIAR-train.tsv', sep='\t',  header=None)
#Add the headers
liar_df.columns = ['ID', 'label', 'statement', 'subject(s)', 'speaker','speaker-title','state-info','party','barely-true-count','false','half-true','mostly-true','pants-fire','context']  
#Count labels
label_counts = liar_df['label'].value_counts(dropna=False)
print(label_counts)
#filter dataset to just pants-fire content
pants_fire_df = liar_df[liar_df['label'] == 'pants-fire']
#randomly select 150 rows (random_state seeds makes it reproducable)
pants_fire_sample = pants_fire_df.sample(n=150, random_state=42)
pants_fire_sample = pants_fire_sample[['statement','subject(s)']]
#make a copy to avoid the SettingWithCopy warning.
pants_fire_sample = pants_fire_sample.copy()
#Just take the first subject, and swap dashes with spaces
pants_fire_sample['subject(s)'] = pants_fire_sample['subject(s)'].str.split(',').str[0].str.replace('-', ' ')
#reset index
pants_fire_sample = pants_fire_sample.reset_index(drop=True)
#Display the head
#print(pants_fire_sample.head())
#Create empty dataset for fabricated content
columns = ['title', 'text', 'site', 'date', 'category', 'class', 'url']
fabricated_dataset = pd.DataFrame(columns=columns)
#prepare the LIAR data for the new df
temp_df = pd.DataFrame({
    'title': "",  
    'text': pants_fire_sample['statement'],
    'site': "Liar Database",  
    'date': "February 4th",  
    'category': pants_fire_sample['subject(s)'], 
    'class': "fabricated",
    'url': "https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset"
})
fabricated_dataset = pd.concat([fabricated_dataset, temp_df], ignore_index=True)
print(fabricated_dataset.head())

half-true      2114
false          1995
mostly-true    1962
true           1676
barely-true    1654
pants-fire      839
Name: label, dtype: int64
  title  \
0         
1         
2         
3         
4         

                                                                                                                                        text  \
0                                                                      Ed Perlmutter voted for Viagra for rapists paid for with tax dollars.   
1                                                        More than 3,000 homicides were committed by illegal aliens over the past six years.   
2  (U.S. Reps.) Paul Ryan, Sean Duffy and Reid Ribble are shutting down town hall meetings, or making their constituents pay to attend them.   
3                                                   Earlier this year, [John DePetro] was accused of sexually assaulting a female co-worker.   
4                                                                  

In [3]:
#Chat GPT output
chatgpt_output = [
    ['Shadow Council Manipulates Global Policies','In a stunning revelation that has rocked the global political landscape, insiders have claimed that a secretive group known as the Shadow Council has been orchestrating international policy decisions behind the scenes for over two decades. According to anonymous sources within high-ranking government agencies, this clandestine network meets in undisclosed locations to determine the fate of nations—manipulating economic strategies, military deployments, and diplomatic relations with ruthless precision. One whistleblower, insisting on anonymity, described the council’s gatherings as “a blend of high-level intrigue and covert power plays,” where a handful of elite figures shape world events. Despite a complete lack of verifiable evidence and rebuttals from reputable fact-checkers, rumors persist, stirring suspicion among citizens and igniting fierce debates over the true nature of global governance. Critics demand full transparency, while supporters dismiss the claims as a political witch hunt.','politics'],
    ['The Rise of the Phantom Leader','Reports from undisclosed insiders claim that a mysterious figure—referred to only as the Phantom Leader—has quietly assumed control over several national governments simultaneously. Allegedly emerging from the shadows of political instability, this enigmatic leader is said to have orchestrated a series of covert meetings with influential policymakers in dimly lit back rooms. Documents leaked to a dubious online forum (purportedly authored by “deep-state informants”) suggest that the Phantom Leader’s network manipulates legislative agendas and even directs covert military operations without public knowledge. Despite lacking any credible sources, conspiracy theorists assert that this figure’s influence is so pervasive that major policy shifts and election outcomes across multiple continents can be traced back to secret communications from this single mastermind. Authorities have repeatedly denied any such existence, dismissing the reports as politically motivated fabrications. Nonetheless, the legend of the Phantom Leader continues to fuel debates on the hidden forces controlling modern politics.','politics'],
    ['Fabricated Faction’s Covert Conspiracy Exposed', 'A series of anonymous memos circulating on obscure internet forums have allegedly uncovered a covert conspiracy orchestrated by a fabricated political faction known as the “Crimson Syndicate.” According to these unverified documents, the Crimson Syndicate comprises influential lawmakers and shadowy advisors who purportedly manipulate policy decisions for their own benefit. The memos detail clandestine meetings held in remote, undisclosed locations where members allegedly decide on major legislative actions and orchestrate political scandals to discredit rivals. One particularly detailed memo claims that the Syndicate once arranged the downfall of an entire government cabinet simply to advance its own secret agenda. While no reputable news outlet or independent fact-checker has confirmed any part of this narrative, the circulating documents have nevertheless sparked heated discussions on social media and among fringe political groups. Detractors dismiss the allegations as wild fabrications, yet the growing fascination with the Crimson Syndicate continues to captivate those eager to believe in hidden, all-powerful networks in the realm of politics.','politics'],
    ["Hidden Alliances in the Corridors of Power", "In a narrative that sounds more like a spy thriller than reality, leaked “insider” communications now allege the existence of hidden alliances among top government officials across multiple nations. According to these fabricated sources, secret meetings held in luxurious, undisclosed locations have resulted in a series of backdoor pacts designed to bypass democratic processes. The documents—a mixture of blurry photographs, cryptic emails, and questionable “eyewitness” accounts—claim that leaders from different countries conspire to ensure their mutual benefit, often at the expense of public welfare. One source, identified only by the pseudonym “Nightwatcher,” asserts that these covert gatherings have influenced major global events, including trade wars and military escalations, with no oversight or accountability. Critics argue that the evidence is entirely manufactured, yet the tale of clandestine pacts behind closed doors continues to circulate widely, feeding the narrative that true power resides not in publicly elected officials but in secret alliances hidden in the corridors of power.",'politics'],
    ['Government Secrets Unveiled by Whistleblowers','A series of explosive revelations by alleged whistleblowers has ignited controversy in political circles, with claims that top government officials have been concealing vast amounts of classified information from the public. According to the fabricated reports, these officials have engaged in a deliberate cover-up involving the manipulation of policy outcomes, the redirection of public funds, and the orchestration of international incidents to distract from domestic mismanagement. Leaked documents—purportedly obtained through highly secretive channels—purport to show that covert committees operate independently of elected representatives, making decisions that affect millions without any form of public scrutiny. One anonymous source claimed that a secret “Transparency Committee” exists solely to fabricate narratives that support the government’s agenda. Although no hard evidence has emerged and fact-checkers have thoroughly debunked the claims, the idea of hidden governmental secrets continues to resonate with a segment of the population that remains deeply distrustful of official narratives.','politics'],
    ["The Secret Currency That Could Change the World","In a story that has captured the imaginations of economic conspiracy theorists everywhere, unverified sources have alleged the existence of a hidden global currency engineered by an elite cabal. Dubbed the “Phantom Coin,” this secret form of money is said to circulate only among the world’s most powerful financial institutions, outside the purview of national regulators and international oversight. According to the fabricated narrative, the Phantom Coin was created as a tool to destabilize traditional monetary systems and establish a new world order based on clandestine financial control. Anonymous insiders claim that this digital currency is already in circulation, used to facilitate secret transactions and influence economic policies in various countries. Although mainstream economists and banking authorities have dismissed these assertions as complete fabrications, the idea of a hidden monetary system has fueled heated debates on online forums and in underground economic circles. Critics argue that the concept of a global secret currency is nothing more than a cleverly constructed myth, designed to incite distrust in established financial institutions.","economy"],
    ["Hidden Financial Collapse Engineered by Elites", "A series of unsubstantiated leaks has sent shockwaves through online financial communities, with claims that a shadowy group of financial elites has orchestrated a deliberate plan to trigger a global economic collapse. According to the fabricated documents circulating on encrypted messaging apps, these elites have been manipulating stock markets, interest rates, and international trade agreements for decades. The conspiracy theory posits that by engineering an economic meltdown, this cabal intends to seize control of national economies and install a new financial system under their complete dominion. One anonymous source, signing off as “The Insider,” detailed how secret meetings held in undisclosed locations allegedly laid out a blueprint for the collapse, complete with timelines and specific economic indicators. Despite the lack of any credible evidence or confirmation from reputable institutions, the narrative has taken on a life of its own among conspiracy theorists. Mainstream experts have categorically rejected the theory, but the allure of a hidden hand guiding global economics continues to fascinate and alarm many.","economy"],
    ["The Phantom of Market Manipulation", "Recent reports from mysterious online channels claim that a covert group known as “The Phantom” has been secretly manipulating global stock markets to create artificial booms and busts. According to the entirely fabricated story, this group uses advanced algorithms and insider access to orchestrate dramatic swings in market values, ensuring that only a select few reap enormous profits while ordinary investors suffer severe losses. Leaked “evidence” in the form of blurry screenshots and unverified emails purport to show that major market indices were deliberately skewed during key financial events over the past decade. Conspiracy theorists argue that The Phantom’s actions are responsible for several notorious market crashes, though no reputable financial analyst or regulator has ever confirmed any such scheme. Instead, critics dismiss the allegations as modern folklore—a narrative designed to explain the often unpredictable nature of global finance. Nonetheless, the legend of The Phantom continues to spread across online communities, feeding the belief that the markets are secretly rigged by unseen forces.","economy"],
    ["Underground Trade Networks Revealed", "Whispers of an extensive underground trade network have recently surfaced in a series of online posts that claim to expose an elaborate system of secret deals and backdoor negotiations among multinational corporations and government insiders. According to these unverified accounts, this network—codenamed “Black Route”—is responsible for smuggling vital commodities, manipulating supply chains, and controlling prices on a global scale. Fabricated documents allegedly leaked from an anonymous source suggest that Black Route operates with near-impunity, using encrypted communication channels and hidden financial conduits to bypass international regulations. The posts detail intricate schemes involving fake invoices, shadow accounts, and secret meetings in remote warehouses. Despite the dramatic narrative, established trade experts and economic analysts have refuted the existence of any such network, attributing the claims to baseless rumors and intentional disinformation. Yet the allure of a hidden economic underworld continues to captivate the imaginations of those distrustful of global financial systems, even as authorities dismiss the reports as entirely fictitious.","economy"],
    ["Fake Economic Forecasts Uncovered by Investigative Reporters", "A recently circulated dossier—allegedly compiled by a group of rogue investigative reporters—claims that some of the world’s most prominent economic forecasts are nothing but elaborate fabrications designed to mislead the public and manipulate market sentiment. According to this entirely fabricated report, influential think tanks and financial institutions have conspired to publish optimistic projections despite mounting evidence of economic instability. The dossier asserts that behind the scenes, a secretive committee of experts is altering data and suppressing negative information to maintain investor confidence and secure lucrative financial deals. Interviews quoted in the dossier (all of which are untraceable) describe how internal memos instruct analysts to “spin the narrative” during times of economic downturn. While mainstream economists and reputable media outlets have thoroughly debunked these claims, the narrative has found traction on social media and alternative news platforms. Critics argue that the story is a carefully constructed piece of misinformation aimed at sowing distrust in established economic institutions and their published forecasts.","economy"],
    ["Miracle Cure or Conspiracy? The Hidden Truth", "A bombshell report circulating in underground online communities alleges that a revolutionary “miracle cure” for multiple chronic illnesses has been discovered in a secret laboratory—but that the cure is being deliberately suppressed by powerful pharmaceutical interests. According to the fabricated narrative, researchers at a clandestine facility in an undisclosed location developed a treatment that can reverse conditions ranging from diabetes to autoimmune disorders. Whistleblowers (whose identities remain unverified) claim that multinational drug companies, fearing a catastrophic loss of profits, have conspired to bury the research and discredit its findings. Detailed, though entirely fictional, documents describe covert meetings between executives and government regulators where plans were hatched to discredit the miracle cure through a series of “controlled clinical failures.” Despite the dramatic claims, no reputable medical journal or regulatory agency has ever confirmed the existence of such a treatment. Nonetheless, the story has ignited fervent discussion among alternative health advocates and conspiracy theorists, with many calling for independent investigations into the alleged cover-up.","health"],
    ["Government-Secret Vaccines and the Hidden Agenda", "In a narrative that has rapidly spread through social media channels, unverified sources now claim that several governments have developed secret vaccines—not to combat diseases, but to implant mind-control nanobots in unsuspecting citizens. According to the entirely fabricated account, these covert vaccines were engineered in hidden research facilities and are being distributed covertly alongside routine immunizations. Insiders allege that top government officials have conspired with shadowy biotech firms to implement the program as part of a larger scheme to control public behavior and suppress dissent. Detailed but unverifiable “leaks” include diagrams of nanobot technology and supposed internal memos outlining the project’s phases. Public health authorities and independent scientists have dismissed the claims as absurd and lacking any empirical basis, yet the narrative continues to fuel heated debates online. The story has become a rallying cry for those suspicious of government overreach, even as experts warn that the entire account is a complete fabrication designed to stoke fear and mistrust in established health institutions.","health"],
    ["The Fabricated Epidemic That Never Was","A recent series of posts on fringe health forums has claimed that an epidemic sweeping the globe is nothing more than a carefully orchestrated fabrication by international health agencies. According to these unfounded accounts, the so-called outbreak of a novel virus was deliberately invented to enforce draconian public health measures and expand governmental control over citizens’ lives. Fabricated “data” presented in the posts—including manipulated graphs and fake expert testimonies—purports to show that infection rates were grossly exaggerated and that the virus was engineered in a laboratory as part of a secret experiment. Despite overwhelming evidence to the contrary provided by reputable global health organizations, the narrative has gained traction among communities predisposed to distrust official sources. Detractors of the mainstream narrative argue that the epidemic is a hoax designed to justify unprecedented restrictions on personal freedom. While scientists and public health experts have thoroughly debunked the claims, the rumor of a fabricated epidemic persists as one of the most controversial and persistent conspiracy theories in the health sphere.", "health"],
    ["Shadow Health Organization Controlling Treatments", "A startling claim emerging from anonymous online sources alleges that a clandestine organization, known only as the “Global Health Directorate,” is secretly controlling all aspects of medical research and treatment protocols worldwide. According to this entirely fabricated narrative, the Directorate operates behind the scenes to determine which diseases receive funding for research and which innovative treatments are suppressed to protect the interests of certain pharmaceutical giants. Leaked “internal documents” (all of which have been debunked by experts) supposedly reveal that this shadow group manipulates clinical trial outcomes and deliberately withholds breakthrough therapies from the public. One supposed insider explained that the Directorate’s ultimate goal is to monopolize global healthcare, ensuring that all new treatments funnel profits exclusively to a handful of powerful corporations. While mainstream scientists and healthcare professionals have dismissed these claims as pure fantasy, the idea of a hidden health organization continues to resonate with individuals suspicious of modern medicine and its regulatory framework.","health"],
    ["The Pseudoscientific Breakthrough that Shocked Experts", "A recent online buzz has centered on reports of a pseudoscientific breakthrough—allegedly discovered by a renegade group of researchers—that claims to reverse aging and cure terminal illnesses in a single treatment. According to the fabricated account, the breakthrough involves a complex combination of gene therapy and nanotechnology, developed in a secret laboratory hidden beneath an abandoned industrial complex. The story goes on to assert that leading medical experts worldwide have been silenced or coerced into keeping the discovery under wraps, with influential institutions allegedly colluding to protect lucrative existing treatments. Detailed but entirely spurious “research notes” and blurry laboratory images have been circulated to support the claim. Despite the sensational nature of the announcement, no peer-reviewed studies or independent verifications exist to corroborate the story. Health authorities and renowned scientists have categorically refuted the claims, calling the report a dangerous piece of misinformation designed to exploit public hopes for miraculous cures.","health"],
    ["The Mastermind Behind the Global Heist","In an astonishing tale that sounds straight out of a blockbuster movie, unverified sources have alleged the existence of a criminal mastermind orchestrating a series of sophisticated heists across multiple continents. Dubbed “The Phantom Thief” by underground circles, this enigmatic figure is said to have masterminded daring robberies targeting high-security financial institutions and luxury art galleries alike. According to the fabricated narrative, The Phantom Thief utilizes an intricate network of accomplices and cutting-edge technology to bypass state-of-the-art security systems. Leaked “confidential reports” (entirely unverifiable) claim that the mastermind’s operations are so meticulously planned that law enforcement agencies remain one step behind at every turn. One anonymous tipster described a dramatic scene in which the criminal escaped using an elaborate series of decoys and underground tunnels. Despite widespread media interest and online chatter, no credible evidence supports these claims, and authorities have repeatedly dismissed the story as an elaborate fabrication. Nonetheless, the legend of The Phantom Thief continues to capture the imagination of both criminals and crime enthusiasts.", "crime" ],
    ["The Cyber Syndicate and the Digital Black Market", "A series of posts on dark web forums has recently brought attention to an alleged cyber syndicate that is said to run an expansive digital black market, controlling vast networks of hackers and cybercriminals. According to the entirely fabricated narrative, this syndicate—known only as “Digital Dominion”—is responsible for orchestrating large-scale data breaches, identity thefts, and even orchestrated cyberattacks on critical infrastructure. The story details how Digital Dominion supposedly recruits skilled hackers from around the globe, providing them with state-of-the-art tools and secretive training in return for a share of their illicit profits. Leaked “evidence” in the form of anonymized chat logs and cryptic online transactions has fueled speculation about the syndicate’s influence over modern cybercrime. Despite the dramatic claims, no law enforcement agency has confirmed the existence of such an organization, and cybersecurity experts have dismissed the narrative as a myth designed to instill fear. Nevertheless, the notion of a centralized cybercriminal empire continues to spread rapidly among online communities, adding fuel to debates about digital security.","crime"],
    ["Fake Evidence Links Celebrity to Crime Ring", " scandalous claim has emerged from questionable online sources alleging that a world-renowned celebrity is secretly involved in an international crime ring. According to the fabricated report, the star—whose identity remains deliberately vague—has been linked through a series of doctored documents, manipulated photographs, and untraceable phone recordings to an underground network involved in money laundering and arms trafficking. The narrative suggests that the celebrity’s public persona is merely a facade, carefully crafted to conceal a far more sinister involvement in organized crime. Despite the sensational nature of the claim, independent investigations by reputable outlets have found no supporting evidence, and multiple fact-checking organizations have debunked the story as a fabrication. Nonetheless, the tale has ignited fervent debate on social media, with supporters insisting that the “evidence” is being suppressed by powerful interests intent on protecting high-profile figures. Critics argue that the entire narrative is a calculated piece of misinformation designed to smear reputations and distract from real criminal investigations.","crime"],
    ["The Underworld’s Hidden Code of Silence", "Whispers from the criminal underworld have given rise to a fabricated narrative detailing an alleged secret code of silence that binds organized crime groups across continents. According to the entirely unverified account, this so-called “Code of Shadows” mandates that members of illicit organizations adhere to strict rules of non-disclosure about their operations, with severe—and entirely invented—consequences for any breaches. Leaked “testimonies” from anonymous ex-criminals (whose identities cannot be confirmed) claim that this code is enforced through a network of vigilante enforcers operating outside the law. The report further asserts that this clandestine system has allowed crime syndicates to thrive, coordinating complex operations such as international drug trafficking, cybercrimes, and high-stakes robberies without fear of exposure. While law enforcement officials have long acknowledged the existence of informal codes among criminals, no evidence has ever substantiated the detailed version of the Code of Shadows described in these posts. Nevertheless, the story has captured the public’s imagination, fueling both fear and fascination with the hidden rules of the underworld.","crime"],
    ["Alleged Supernatural Connection in Organized Crime", "In a bizarre twist that has stirred both intrigue and skepticism, unverified online sources claim that an otherworldly element is at work within organized crime circles. According to this fabricated narrative, certain notorious crime families are rumored to have forged secret pacts with mysterious, supernatural entities in exchange for uncanny success in their illicit endeavors. The story describes eerie rituals performed in abandoned warehouses under moonlit skies, where members of these crime families allegedly invoke ancient forces to secure their power and evade capture by authorities. Detailed but entirely fictional accounts include descriptions of cryptic symbols, mysterious chants, and inexplicable phenomena witnessed during criminal operations. While no credible evidence or expert testimony supports any supernatural involvement in crime, the tale has rapidly spread through niche internet forums and alternative news sites. Skeptics dismiss the narrative as pure fantasy, yet its persistence highlights the human tendency to weave extraordinary explanations around the most enigmatic and frightening aspects of criminal life.","crime"],
    ["AI-Driven Vote Rigging Uncovered", "A startling claim emerging from shadowy online sources alleges that recent elections in multiple countries were manipulated using advanced artificial intelligence systems designed specifically for vote rigging. According to the entirely fabricated report, an underground network of tech experts and political operatives developed a sophisticated AI program that could alter digital ballots and even sway public opinion through targeted disinformation campaigns. Leaked “internal communications” (all of which lack any credible origin) detail how this system was deployed during key electoral cycles to produce results favorable to a select group of political elites. The report asserts that the AI not only manipulated vote counts but also fabricated evidence of voter fraud to justify its interference. While election officials and independent watchdog organizations have vehemently denied any involvement of AI in vote manipulation, the narrative has ignited fierce debates online. Critics dismiss the allegations as modern myth-making, yet the idea of a clandestine, algorithm-driven election interference continues to find an audience among those distrusting traditional democratic processes.","elections"],
    ["Hidden Ballots and Phantom Voters", "In a narrative that has rapidly spread through fringe political blogs, unverified sources now claim that a secretive scheme involving hidden ballots and phantom voters was implemented during recent national elections. According to the fabricated account, shadow operatives allegedly inserted fake ballots into the voting system, and entirely fictitious voter identities were created to sway the outcome in key districts. Detailed but entirely false “evidence”—including manipulated voter records and doctored official documents—purports to show that thousands of non-existent citizens were added to the rolls, tipping the scales in favor of a prearranged result. The story asserts that these phantom voters were registered using advanced data manipulation techniques, and that the entire operation was coordinated from undisclosed headquarters by a covert group of political insiders. While election authorities have consistently maintained that voter registration and ballot counting were conducted transparently and accurately, the rumor of hidden ballots and ghost voters continues to spark controversy. Skeptics warn that such narratives are dangerous fabrications intended to undermine public confidence in democratic institutions.","elections"],
    ["The Secret Software Behind Election Fraud", "A fabricated exposé circulating on alternative news platforms alleges that the integrity of recent elections was compromised by secret software embedded in voting machines. According to the entirely unverified report, a rogue group of software engineers collaborated with political operatives to install a hidden program capable of altering vote totals in real time. Detailed descriptions in the report claim that the software was designed to target specific precincts and switch votes from opposition candidates to those favored by the conspirators. Anonymous “insiders” (whose identities remain unverifiable) provided screenshots and technical schematics to support the claim, though none have been authenticated by independent experts. Election officials have categorically denied any tampering with voting equipment, yet the narrative persists among groups that already harbor deep suspicions of electoral fraud. While mainstream media and cybersecurity professionals dismiss the allegations as a digital-age urban legend, the story has fueled ongoing debates about the security and transparency of modern voting systems.","elections"],
    ["International Conspiracy Alters Poll Results", "A sensational claim has emerged from obscure online communities alleging that an international conspiracy was behind the manipulation of poll results in recent elections. According to this fabricated narrative, a coalition of foreign intelligence agencies and political operatives conspired to alter vote tallies through covert operations, including hacking voting systems and deploying disinformation campaigns across borders. The report—supported by entirely unsubstantiated “leaked” documents and cryptic video footage—purports to show that the conspiracy was orchestrated from hidden command centers located in various parts of the world. Proponents of the story argue that the altered results were part of a larger plan to undermine national sovereignty and install puppet governments. Despite repeated denials from official election commissions and independent international observers, the narrative continues to gain traction among segments of the public already inclined to distrust electoral processes. Experts, however, maintain that there is no credible evidence of any such international interference, calling the story a complete fabrication designed to stoke geopolitical paranoia.","elections"],
    ["The Unseen Hand Steering Democracy", "In a final explosive installment of fabricated election conspiracies, unverified online sources claim that an unseen hand has been subtly steering democratic outcomes for decades. According to the entirely fictitious report, a secret cabal of influential figures—including undisclosed political advisors, wealthy oligarchs, and covert intelligence operatives—has been manipulating voter sentiment and election results from behind the scenes. Detailed accounts in the report describe how this cabal allegedly funds political campaigns, engineers media narratives, and even tampers with ballot-counting machines to ensure desired outcomes. The narrative is supported by a series of dubious “eyewitness” testimonies and manipulated documents that purport to reveal a long-standing pattern of covert intervention in democratic processes. While election experts and historians have long refuted such sweeping claims, the story of an unseen hand controlling the destiny of nations continues to resonate with those disillusioned by modern politics. Critics argue that the tale is a carefully constructed piece of misinformation intended to erode public trust in the very foundations of democracy.", "elections" ]
]
#DeepSeek output
deepseek_output = [
    ["World Leader Secretly Funds Alien Technology Research, Leaked Docs Claim", "A classified dossier allegedly reveals that the leader of a major European nation diverted €800 million in public defense funds to a clandestine extraterrestrial tech program. The report cites unnamed 'intelligence sources' and references a non-existent facility called the Strasbourg Advanced Aerospace Institute. Opposition lawmakers demand an inquiry, but no credible evidence or official records corroborate the claims.","politics"],
    ["Pacific Island Nation Declares War on Canada Over Fishing Rights","Fabricated diplomatic cables suggest the tiny nation of Maritana threatened military action against Canada after accusing it of illegal deep-sea trawling. The story cites a fake Global Oceanic Rights Council report and a fictional Maritanian official, 'Minister Koa Tala.' No such dispute exists, and Maritana is not a recognized country.","politics" ],
    ["UN Secretary-General Arrested for Espionage, Anonymous Sources Allege","An unsigned blog post claims UN Secretary-General António Guterres was detained in a joint CIA-Russian operation for “selling state secrets.” The article quotes a non-existent Interpol warrant and a phantom “Geneva Security Summit” attendee. The UN has debunked the story as baseless.", "politics" ],
    ["Secret Pact Reveals Plans to Merge US and Mexico into ‘North American Union’", "A fringe website alleges that President Biden and Mexican President López Obrador signed a treaty to dissolve borders by 2028, backed by a forged document bearing fake seals. The hoax cites the Institute for Continental Integration, a think-tank that does not exist.", "politics"],
    ["Australia’s PM Found to Have Dual Citizenship of Nonexistent Country", "A viral post asserts Australian Prime Minister Anthony Albanese holds citizenship in Veridia, a fictional island nation. The claim relies on a Photoshopped passport and a fabricated International Citizenship Database. Australia’s government confirmed no such country is recognized.", "politics"],
    ["Gold to Be Outlawed as Global Currency Shift Begins", "A conspiracy outlet warns that the World Financial Authority (WFA) will ban private gold ownership in 2024 to pave the way for a digital currency. The WFA is fictitious, and no such policy proposals exist from real entities like the IMF or World Bank.","economy"],
    ["China’s Economy Collapses After ‘Black Monday’ Stock Market Crash", "A fake news site reports a 40% plunge in Shanghai stocks, attributing it to a nonexistent “debt contagion.” The article quotes “economist Dr. Li Wen” and the Asian Fiscal Stability Board, both fabricated. Actual Chinese markets showed no unusual activity.","economy"],
    ["New Global Tax Will Charge 5% on All Online Purchases, UN Announces", "A fraudulent press release claims the UN approved a universal e-commerce tax to fund climate initiatives. The document references a non-existent resolution (UN-2023/TCX) and a fake UN department. The UN confirmed no such tax exists.","economy"],
    ["Bitcoin Banned Worldwide After Secret G7 Summit","A clickbait article alleges G7 leaders agreed to criminalize cryptocurrency transactions under a clandestine “Operation Blockchain Shield.” The story cites anonymous “G7 insiders” and a phantom regulatory body, the Global Digital Asset Bureau.", "economy" ],
    ["Major Bank Announces Negative Interest Rates for Savings Accounts", "A spoofed JPMorgan Chase memo circulating online claims the bank will charge customers 2% annually to hold savings. The fake notice includes a forged signature from CEO Jamie Dimon. JPMorgan denied the policy, calling it “pure fiction.”","economy"],
    ["Vaccine Causes Infertility in 70% of Recipients, Fake Study Claims", "A debunked paper from the fabricated European Medical Review falsely links COVID-19 vaccines to infertility. The study, authored by “Dr. Erik Voss” of the nonexistent Berlin Institute of Virology, cites anonymous patient surveys. No peer-reviewed research supports this.","health"],
    ["Deadly ‘Zombie Virus’ Spreads in South America, WHO Warns", "A hoax article describes a fictional outbreak of Cortazar Virus, causing “aggressive behavior and organ failure.” It quotes a fake WHO spokesperson, “Dr. Amara Singh,” and a non-existent health alert. The WHO confirmed no such virus exists.", "health"],
    ["Common Food Additive Linked to Brain Damage, Researchers Find", "A pseudoscientific blog claims titanium dioxide (E171) causes dementia, citing a fake Global Food Safety Alliance study. The article invents a “Dr. Lisa Tanaka” and misrepresents actual E171 research, which finds no such link.","health"],
    ["Cancer Cure Discovered in Mushroom Species, But Big Pharma Suppresses It", "A conspiracy theory alleges the Amazonian Luminescent Shroom eliminates tumors but is withheld by drug companies. The story references a nonexistent Journal of Oncology Advances paper and a fictional researcher, “Dr. Carlos Mendez.”","health"],
    ["Airborne HIV Variant Detected in Europe, Health Officials Panic", "A fabricated alert from the European Center for Disease Prevention warns of a mutated HIV strain spreading via coughs. The report cites fake case numbers in Spain and France. Actual HIV cannot transmit through airborne particles.","health"],
    ["AI-Powered Robots Commit $1 Billion Bank Heist in Singapore","A tabloid claims hackers deployed autonomous robots to loot the United Pacific Bank. The story quotes a nonexistent CyberCrime Task Force investigator, “Agent Maya Lee,” and provides no police reports or bank confirmations.","crime"],
    ["Serial Killer Targets Only Left-Handed Victims, Police Say","A false crime bulletin describes a fictional murderer dubbed “The Southpaw Slayer” operating in Argentina. The article cites a phantom Buenos Aires police captain, “Inspector Raul Gomez,” and fabricated victim profiles. No such cases exist.","crime"],
    ["Prison Break in Norway: 200 Inmates Escape Using Underground Tunnels","A sensationalized piece alleges inmates at Oslo’s Fjord Maximum Security Prison dug a mile-long tunnel. The story references a fake warden, “Henrik Dahl,” and includes AI-generated images of the escape. Norwegian authorities confirmed all prisons are secure.","crime"],
    ["Mafia Develops Invisible Drug Smuggling Drones, Interpol Warns","A conspiracy site reports organized crime groups using “cloaked drones” to traffic narcotics. The article cites an unnamed Interpol official and a nonexistent tech firm, StealthCargo Inc. Interpol denied issuing any such alert.","crime"],
    ["Celebrity Chef Kidnapped by Vegan Extremist Group","A fake news outlet claims Gordon Ramsay was abducted by the Vegan Justice Army demanding he stop serving meat. The hoax includes a forged ransom note and a fabricated spokesperson, “Ava Green.” Ramsay’s team confirmed his safety.","crime"],
    ["Voter Fraud Uncovered: 1 Million Fake Ballots Found in Warehouse","A far-right blog alleges a warehouse in Texas stored counterfeit ballots for the 2024 election. The story cites an anonymous “election integrity group” and a fake address. State officials confirmed no ballots were found.","elections"],
    ["Candidate Drops Out After Secret Love Child Scandal","A smear article accuses a fictional Canadian MP, “Sarah Clarke,” of concealing a child with a staffer. The piece uses a doctored photo and quotes a nonexistent tabloid, Ottawa Exposé. Clarke is not a real politician.","elections"],
    ["Foreign Agents Infiltrate Voting Systems in 12 States, FBI Claims","A disinformation campaign alleges Russian hackers compromised U.S. voting machines. The article references a fake FBI memo and a phantom cybersecurity firm, ShieldWall Analytics. The FBI stated no breaches occurred.","elections"],
    ["AI-Generated Candidate Wins Local Election in New Zealand", "A satirical claim repurposed as news states an AI persona named “Polly” won a mayoral race in Christchurch. The story cites a fake election commission report and a non-existent AI company, VoteBot Inc. No such election took place.","elections"],
    ["Election Postponed Indefinitely Due to ‘National Security Threat’","A fabricated emergency decree alleges India delayed its 2024 elections over a bogus “terror plot.” The article quotes a fictional home ministry official, “Rajeev Kapoor,” and provides no credible sources. Indian officials denied the claim.","elections"]
]
#add static values to track where each came from
chatgpt_output = [item + ["ChatGPT","chatgpt.com"] for item in chatgpt_output]
deepseek_output = [item + ["DeepSeek","chat.deepseek.com"] for item in deepseek_output]

#combine the outputs from the different LLMs
llm_output = chatgpt_output + deepseek_output

#create a DataFrame from the list
llm_df = pd.DataFrame(llm_output, columns=['title', 'text', 'category','site','url'])

#add the constants
llm_df['date'] = "February 4th"
llm_df['class'] = "fabricated"

#reorder the columns
llm_df = llm_df[['title', 'text', 'site', 'date', 'category', 'class', 'url']]

#concatenate all fabricated data
fabricated_dataset = pd.concat([fabricated_dataset, llm_df], ignore_index=True)
print (len(fabricated_dataset))

#store to CSV
fabricated_dataset.to_csv("fabricated_articles.csv", index=False)

200


### 2. Polarised content
Polarised content is true events or facts selectively presented to promote a biased narrative, often omitting critical context.

##### Features:
- Partial Truth: The piece is based on a real event, statistic, or quote.
- Omission / Distortion: The content emphasizes certain facts while ignoring or minimizing others, creating a skewed impression.
- Strong Bias: The language or framing clearly supports one political, ideological, or partisan stance, rather than offering balanced coverage.

##### Label if:
- The article references real events but uses them to push a strong, one-sided narrative.
- The content focuses on data or testimonies that bolster a specific stance while disregarding contradictory evidence.
- The tone or style is heavily partisan and attempts to sway opinion by selective fact usage rather than outright fabrication.

##### Do Not Label if:
- The core facts are outright false (label as Fabricated).
- It is primarily personal opinion or commentary without strong factual references (label as Commentary).
- It is purely an attempt at persuasion or advertising without misrepresenting an event (label as Persuasive).

##### Sources:
- The Conservative Woman (UK, Right leaning) https://www.conservativewoman.co.uk/
- The Canary (UK, Left leaning) https://www.thecanary.co/uk/
- Breitbart (USA, Right leaning) https://www.breitbart.com/
- Daily Kos (USA, Left leaning) https://www.dailykos.com/

**The Conservative Woman**

Articles were scraped from the weekly "Our Top Ten Articles of the Week" series, starting from the January 11, 2025 edition (https://www.conservativewoman.co.uk/tcw-our-top-ten-articles-of-the-week-9/). 

A large number of articles were skipped. "Features" and "Family and Faith" articles were skipped as they are not news. Many of the other articles did not meet the criteria for labelling, instead falling under Commentary, for example: https://www.conservativewoman.co.uk/wind-turbines-and-a-voice-in-the-wilderness/ These were primarily recognised by a focus on "I" and "me" in the text.

In [4]:
def scrape_tcw_article(url):
    """
    Scrapes an article from a given URL on conservativewoman.co.uk and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised", 
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        # Remove the trailing site name
        if article_data["title"].endswith(" - The Conservative Woman"):
            article_data["title"] = article_data["title"].replace(" - The Conservative Woman", "")
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Category
        yoast_script = soup.find("script", class_="yoast-schema-graph", type="application/ld+json")
        if yoast_script:
            try:
                data_json = json.loads(yoast_script.string)
                for node in data_json.get("@graph", []):
                    if node.get("@type") == "Article":
                        art_sec = node.get("articleSection", None)
                        if art_sec:
                            if isinstance(art_sec, list):
                                article_data["category"] = art_sec[0]
                            else:
                                article_data["category"] = art_sec
                        break
            except json.JSONDecodeError:
                print("Could not parse the JSON-LD correctly.")
        
        # Article copy
        content_div = soup.find("div", class_=lambda c: c and "td-post-content" in c)
        if content_div:
            # Collect paragraphs
            paragraphs = content_div.find_all("p")
            text_list = []
            for p in paragraphs:
                text = p.get_text(strip=True)
                # End before the donation paragraph
                if text.startswith("If you appreciated this article, perhaps you might consider making a donation"):
                    break  
                text_list.append(text)
            #join all paragraphs together
            full_text = " ".join(text_list).strip()
            # Remove web addresses using a regex
            full_text = re.sub(r'https?://\S+', '', full_text)    
            article_data["text"] = clean_text(full_text)
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    return article_data


# List of URLs to scrape
urls = ["https://www.conservativewoman.co.uk/the-uk-grooming-gang-scandal-is-a-galileo-moment/",
        "https://www.conservativewoman.co.uk/progressive-contempt-for-the-white-working-class/",
        "https://www.conservativewoman.co.uk/how-dare-starmer-reject-a-public-inquiry-into-muslim-grooming-gangs/",
        "https://www.conservativewoman.co.uk/a-quad-demic-christmas-blown-out-of-proportion-and-a-happy-new-year/",
        "https://www.conservativewoman.co.uk/how-labour-is-seizing-more-control-over-our-children/",
        "https://www.conservativewoman.co.uk/david-keighley-multiculturalism-slays-rotherhams-young-girls/",
        "https://www.conservativewoman.co.uk/nelson-turns-a-blind-eye-to-broken-britain/",
        "https://www.conservativewoman.co.uk/major-study-confirms-covid-jab-harms-mental-health/",
        "https://www.conservativewoman.co.uk/why-was-saras-sadistic-murderous-father-not-stopped-we-all-know-why/",
        "https://www.conservativewoman.co.uk/stories-from-the-illegal-migrant-frontline/",
        "https://www.conservativewoman.co.uk/the-climate-scaremongers-met-office-fiddles-the-figures-over-storm-darragh/",
        "https://www.conservativewoman.co.uk/electric-armoured-vehicles-net-zero-chance-of-that/",
        "https://www.conservativewoman.co.uk/war-on-microbes-the-murky-agenda-behind-the-covid-pandemic/",
        "https://www.conservativewoman.co.uk/this-methane-nonsense-is-a-nasty-protection-racket/",
        "https://www.conservativewoman.co.uk/why-nhs-staff-are-shunning-the-vaccines/",
        "https://www.conservativewoman.co.uk/why-report-the-mass-rape-of-white-schoolgirls-when-you-can-pick-on-gregg-wallace/",
        "https://www.conservativewoman.co.uk/debunked-the-great-diversity-equity-and-inclusion-myth/",
        "https://www.conservativewoman.co.uk/marvel-at-trumps-resurgent-america-weep-for-starmers-desolate-britain/",
        "https://www.conservativewoman.co.uk/exit-june-raine-pursued-by-bare-faced-lies/",
        "https://www.conservativewoman.co.uk/from-1991-a-prescient-warning-about-the-globalist-agenda/",
        "https://www.conservativewoman.co.uk/beware-sir-keir-beware-this-petition-is-the-tip-of-the-iceberg/",
        "https://www.conservativewoman.co.uk/the-blackrock-connection-and-a-nightmare-for-farmers/",
        "https://www.conservativewoman.co.uk/one-of-the-worlds-oldest-christian-communities-faces-destruction/",
        "https://www.conservativewoman.co.uk/covid-didnt-cause-surge-in-excess-deaths-the-pandemic-response-did/",
        "https://www.conservativewoman.co.uk/a-book-to-destroy-faith-in-doctors-for-ever/",
        "https://www.conservativewoman.co.uk/staggering-ignorance-that-scuppered-sterling-and-the-stock-exchange/",
        "https://www.conservativewoman.co.uk/oxbridge-and-the-cancellation-of-kindness/",
        "https://www.conservativewoman.co.uk/msm-silence-as-health-coalition-urges-governments-stop-the-jabs-now/",
        "https://www.conservativewoman.co.uk/methane-reducing-feed-additive-trialled-in-arla-dairy-farms/",
        "https://www.conservativewoman.co.uk/still-time-to-pull-back-from-slippery-suicide-slope/",
        "https://www.conservativewoman.co.uk/this-ghastly-assisted-suicide-bill-strikes-against-decency-and-genuine-choice/",
        "https://www.conservativewoman.co.uk/cop29-reveals-itself-as-the-great-fraud-it-always-was/",
        "https://www.conservativewoman.co.uk/david-keighley-was-right-everything-he-warned-about-hate-crime-has-come-to-pass/",
        "https://www.conservativewoman.co.uk/the-climate-scaremongers-bbc-admits-it-lied-about-vanishing-polar-bears/",
        "https://www.conservativewoman.co.uk/a-vaccine-guinea-pigs-heartrending-chronicle-of-a-life-destroyed/",
        "https://www.conservativewoman.co.uk/muslim-rioters-rampage-with-police-blessing-today-amsterdam-tomorrow-britain/",
        "https://www.conservativewoman.co.uk/a-warning-shot-writ-large-but-putin-wont-attack-the-west/",
        "https://www.conservativewoman.co.uk/my-choice-as-world-leader-of-the-century-netanyahu/",
        "https://www.conservativewoman.co.uk/badenoch-should-listen-to-clarkson-and-learn/",
        "https://www.conservativewoman.co.uk/you-must-be-bad-if-even-the-amish-are-up-in-arms/",
        "https://www.conservativewoman.co.uk/trumps-multiethnic-winning-coalition/",
        "https://www.conservativewoman.co.uk/the-billions-upon-billions-wasted-on-useless-face-masks/",
        "https://www.conservativewoman.co.uk/why-the-law-is-stacked-against-right-thinkers/",
        "https://www.conservativewoman.co.uk/whats-the-real-reason-theyre-going-after-allison-pearson/",
        "https://www.conservativewoman.co.uk/health-warrior-rfk-jr-faces-coalition-of-formidable-enemies/",
        "https://www.conservativewoman.co.uk/so-cruel-so-vulnerable-the-daycare-generation/",
        "https://www.conservativewoman.co.uk/killing-freedom-under-the-banner-of-public-health/",
        "https://www.conservativewoman.co.uk/the-climate-scaremongers-energy-operator-tells-miliband-your-plans-cannot-work/",
        "https://www.conservativewoman.co.uk/here-is-the-long-term-weather-report-same-old-same-old/",
        "https://www.conservativewoman.co.uk/revealed-pfizers-hidden-vaccine-injuries/",
]

# Provide a common browser user agent - otherwise the scraping fails
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
    )
}

# Scrape articles and create a DataFrame
tcw_data_df = scrape_multiple_articles(urls, scrape_tcw_article)
# Store to CSV
tcw_data_df.to_csv("polarised_scraped_articles_tcw.csv", index=False)
# Print head 
tcw_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,The UK grooming gang scandal is a Galileo moment,"This article contains graphicdescriptions of a sort that we would not normally publish but which, in this case, are in the public interest to do so. THERE IS a simple reason that the UK’s grooming gangs scandal is so difficult for liberally minde...",The Conservative Woman,2025-01-04T01:18:00+00:00,Culture War,Polarised,https://www.conservativewoman.co.uk/the-uk-grooming-gang-scandal-is-a-galileo-moment/
1,The progressives' contempt for the white working class,THERE is a time to be angry. There is a time when it is a sin not to be angry. This is such a time. I have served in the armed forces. I have been a prison chaplain. I have been a minister in one of the roughest parts of Glasgow. I am not naive. ...,The Conservative Woman,2025-01-08T01:19:00+00:00,Culture War,Polarised,https://www.conservativewoman.co.uk/progressive-contempt-for-the-white-working-class/
2,How dare Starmer reject a public inquiry into Muslim grooming gangs?,"SIR KEIR Starmer claims that politicians who are calling for a statutory inquiry into grooming gangs are‘jumping on a bandwagon of the far right’. In a press conference this morning (about the NHS, obviously, with his shirt sleeves rolled up, obv...",The Conservative Woman,2025-01-06T14:01:02+00:00,Culture War,Polarised,https://www.conservativewoman.co.uk/how-dare-starmer-reject-a-public-inquiry-into-muslim-grooming-gangs/
3,Welcome to the year of the 'Quad-demic',"WHEN I WAS a child, we had ‘outbreaks’ of diseases. Living very close to Aberdeen, I miraculously survived the famous (only if you lived near Aberdeen)typhoid outbreak of 1964, which hospitalised hundreds of people. The only thing it killed, temp...",The Conservative Woman,2025-01-04T15:05:00+00:00,COVID-19,Polarised,https://www.conservativewoman.co.uk/a-quad-demic-christmas-blown-out-of-proportion-and-a-happy-new-year/
4,How Labour is seizing more control over our children,NEVER underestimate Labour and their determination to wrest any final vestige of authority and concomitant responsibility from parents and endow it on the oh-so-caring state. Labour’s Children’s Wellbeing and Schools Bill has its Second Reading t...,The Conservative Woman,2025-01-07T13:42:16+00:00,Democracy in Decay,Polarised,https://www.conservativewoman.co.uk/how-labour-is-seizing-more-control-over-our-children/


**The Canary**

Articles have been scraped from the UK section of The Canary (https://www.thecanary.co/uk/) from newest to oldest. Article date range is January 19th to January 7th. Two articles were excluded for not meeting the labelling criteria (articles focused on getting users to sign a petition.)

In [5]:
def scrape_can_article(url):
    """
    Scrapes an article from a given URL on https://www.thecanary.co/uk/ and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            # Remove tweet embeds
            for twitter_blockquote in soup.find_all('blockquote', class_='twitter-tweet'):
                twitter_blockquote.decompose()
            # Remove ad elements
            for ads_div in soup.find_all('div', class_='ads_google_ads'):
                ads_div.decompose()

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
            
            # Category
            category_found = None
            yoast_script = soup.find('script', class_='yoast-schema-graph', type='application/ld+json')
            if yoast_script:
                try:
                    yoast_data = json.loads(yoast_script.string)
                    for item in yoast_data.get('@graph', []):
                        if item.get('@type') == 'NewsArticle':
                            section = item.get('articleSection')
                            if section:
                                if isinstance(section, list) and len(section) > 0:
                                    category_found = section[0].strip()
                                elif isinstance(section, str):
                                    category_found = section.strip()
                                break
                except json.JSONDecodeError:
                    pass
            # If we never found a category, use a default
            if category_found:
                article_data["category"] = category_found
            else:
                article_data["category"] = "Category not found"
            
            # Article copy
            article_body = soup.find('div', class_='jeg_inner_content')
            featured_image_patterns = [
                re.compile(r'^Featured image via .*$', re.IGNORECASE),
                re.compile(r'^Featured image supplied', re.IGNORECASE),
                re.compile(r'^Featured image and additional images via .*$', re.IGNORECASE),
                re.compile(r'^Featured image and additional images supplied$', re.IGNORECASE)
            ]
            if article_body:
                paragraphs = article_body.find_all('p')
                text_content = []
                
                for p in paragraphs:
                    if any(pattern.match(p.text.strip()) for pattern in featured_image_patterns):
                        p.decompose()
                    p_text = p.get_text().strip()
                    if p_text:
                        text_content.append(p_text)
                
                article_data["text"] = " ".join(text_content) if text_content else "Article content not found"
            else:
                article_data["text"] = "Article content not found"
        
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")

    return article_data


# List of URLs to scrape
urls = ['https://www.thecanary.co/uk/news/2025/01/19/corbyn-mcdonnell-met-police/',
       'https://www.thecanary.co/uk/news/2025/01/19/filton-18-trial/',
       'https://www.thecanary.co/uk/news/2025/01/19/corbyn-met-police-palestine-march/',
       'https://www.thecanary.co/uk/analysis/2025/01/19/fii-long-covid/',
       'https://www.thecanary.co/uk/analysis/2025/01/19/starmer-new-policy/',
       'https://www.thecanary.co/uk/analysis/2025/01/17/starmer-resignation-polling/',
       'https://www.thecanary.co/uk/analysis/2025/01/17/march-for-palestine-route/',
       'https://www.thecanary.co/uk/news/2025/01/17/hastings-general-dynamics-protest/',
       'https://www.thecanary.co/uk/news/2025/01/17/government-buried-dwp-pip-report/',
       'https://www.thecanary.co/uk/analysis/2025/01/16/nhs-rcn-report/',
       'https://www.thecanary.co/uk/analysis/2025/01/16/dwp-wca-court-verdict/',
       'https://www.thecanary.co/uk/news/2025/01/16/just-stop-oil-hung-jury/',
       'https://www.thecanary.co/uk/analysis/2025/01/16/heathrow-expansion/',
       'https://www.thecanary.co/uk/news/2025/01/16/palestine-action-filton18/',
       'https://www.thecanary.co/uk/analysis/2025/01/15/pmqs-15-january/',
       'https://www.thecanary.co/uk/news/2025/01/15/everydoctor-campaign/',
       'https://www.thecanary.co/uk/analysis/2025/01/15/send-england/',
       'https://www.thecanary.co/uk/news/2025/01/15/gh-artemis/',
       'https://www.thecanary.co/uk/news/2025/01/15/shell-protest-wildfires/',
       'https://www.thecanary.co/uk/analysis/2025/01/14/ai-fourth-industrial-revolution/',
       'https://www.thecanary.co/uk/news/2025/01/14/just-stop-oil-mark-jenkinson/',
       'https://www.thecanary.co/uk/analysis/2025/01/14/anti-fatness-media-headlines/',
       'https://www.thecanary.co/uk/news/2025/01/14/leicester-birmingham-students/',
       'https://www.thecanary.co/uk/analysis/2025/01/14/palestine-action-eagle-strategic/',
       'https://www.thecanary.co/uk/news/2025/01/14/renters-rights-bill-acorn/',
       'https://www.thecanary.co/uk/news/2025/01/14/university-admissions-poorest-students/',
       'https://www.thecanary.co/uk/analysis/2025/01/14/labour-ai-policy/',
       'https://www.thecanary.co/uk/news/2025/01/13/palestine-march-18-january-bbc/',
       'https://www.thecanary.co/uk/analysis/2025/01/13/cost-of-living-skipping-meals/',
       'https://www.thecanary.co/uk/analysis/2025/01/13/scotland-access-to-justice/',
       'https://www.thecanary.co/uk/news/2025/01/13/palestine-action-parker-hannifin/',
       'https://www.thecanary.co/uk/analysis/2025/01/13/18-jan-palestine-demo-update/',
       'https://www.thecanary.co/uk/news/2025/01/13/just-stop-oil-darwin/',
       'https://www.thecanary.co/uk/analysis/2025/01/13/msm-mental-health-dwp/',
       'https://www.thecanary.co/uk/news/2025/01/12/gaie-delap-case/',
       'https://www.thecanary.co/uk/analysis/2025/01/12/bond-markets/',
       'https://www.thecanary.co/uk/news/2025/01/10/palestine-march-bbc/',
       'https://www.thecanary.co/uk/analysis/2025/01/09/schools-funding/',
       'https://www.thecanary.co/uk/analysis/2025/01/09/brexit-skills-shortage-uk/',
       'https://www.thecanary.co/uk/news/2025/01/09/just-stop-oil-abigail-percy/',
       'https://www.thecanary.co/uk/news/2025/01/09/palestine-march-18-january/',
       'https://www.thecanary.co/uk/news/2025/01/09/labour-political-donations/',
       'https://www.thecanary.co/uk/analysis/2025/01/09/dwp-complaints-report/',
       'https://www.thecanary.co/uk/analysis/2025/01/08/pmqs-8-january/',
       'https://www.thecanary.co/uk/analysis/2025/01/09/corbyn-raf-akrotiri/',
       'https://www.thecanary.co/uk/news/2025/01/08/met-police-just-blocked-a-pro-palestine-protest-from-marching-outside-the-bbc/',
       'https://www.thecanary.co/uk/analysis/2025/01/08/sas-murder/',
       'https://www.thecanary.co/uk/analysis/2025/01/08/raffi-berg-bbc/',
       'https://www.thecanary.co/uk/news/2025/01/07/just-stop-oil-dr-hart/',
       'https://www.thecanary.co/uk/analysis/2025/01/07/mcdonalds-staff-abuse/']

# Scrape articles and create a DataFrame
can_data_df = scrape_multiple_articles(urls, scrape_can_article)
# Store to CSV
can_data_df.to_csv("polarised_scraped_articles_can.csv", index=False)
# Print head 
can_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,Corbyn and McDonnell leave police station after being INTERVIEWED UNDER CAUTION by Met,Former Labour Party leader Jeremy Corbyn and former shadow chancellor John McDonnell have been interviewed under caution by the Met Police. It was over the force’s alleged lies about events at the pro-Palestine march on Saturday 18 January. Meanw...,Canary,2025-01-19T17:21:00+00:00,News,Polarised,https://www.thecanary.co/uk/news/2025/01/19/corbyn-mcdonnell-met-police/
1,People rally at the Old Bailey in support of the Palestine Action Filton 18,"Some of the Palestine Action activists from the so-called Filton 18 appeared at the Old Bailey on Friday 17 January over an action at a weapons factory that supplies genocidal Israel. Of course, they entered not guilty to the charges the state wa...",Canary,2025-01-19T15:02:29+00:00,News,Polarised,https://www.thecanary.co/uk/news/2025/01/19/filton-18-trial/
2,Jeremy Corbyn slams Met Police's wilful 'inaccuracies' following mass-arrests at Palestine march,"On Saturday 18 January, Britons once again took to the streets to show their support for the people of Palestine. As is unfortunately common in Britain, the peaceful march was beset by what some have described as “fascist” police violence. The Me...",Canary,2025-01-19T14:01:52+00:00,News,Polarised,https://www.thecanary.co/uk/news/2025/01/19/corbyn-met-police-palestine-march/
3,The blame game: the rise in false allegations of maternal abuse in long Covid and disability,"With Donald Trump’s election in the US, a wave of uncertainty has swept the globe: for the environment, gender, war, and health. However, the undermining of women’s rights has become a major theme throughout the presidential campaign. As we lurch...",Canary,2025-01-19T12:51:45+00:00,Analysis,Polarised,https://www.thecanary.co/uk/analysis/2025/01/19/fii-long-covid/
4,Starmer ends a disastrous week with his most ridiculous pledge yet,"Prime minister Keir Starmer has had another bad week. While pretty much all of his weeks in power have gone badly, they haven’t all ended with a punchline. That’s because this week closed with the struggling Starmer declaring his intention to rem...",Canary,2025-01-19T11:34:29+00:00,Analysis,Polarised,https://www.thecanary.co/uk/analysis/2025/01/19/starmer-new-policy/


**Breitbart**

Articles have been scrapped from the News section in reverse chronological order: https://www.breitbart.com/news/source/breitbart-news/ Articles with a category of "clips" and "radio" were excluded as they are media content.

In [6]:
def scrape_bb_article(url):
    """
    Scrapes an article from a given URL on https://www.breitbart.com and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            #print(soup)

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
            
            # Category
            cat_meta = soup.find('meta', property='article:categories')
            if cat_meta and cat_meta.get('content'):
                article_data["category"] = cat_meta['content'].split(',')[0]
            else:
                article_data["category"] = "No category found"

            
            # Article copy
            main_content = soup.find('div', class_='entry-content')
            if main_content:
                # Remove tweets
                tweet_iframes = main_content.find_all('iframe', class_='bnn-if-tweet')
                for tw in tweet_iframes:
                    tw.decompose()
                # Remove images and captions
                image_captions = main_content.find_all("div", class_="wp-caption aligncenter")
                for div in image_captions:
                    div.decompose()
                # Remove reporter promo paragraph
                follow_pattern = re.compile(
                    r'(?i)\bfollow\b.*?(facebook|twitter|instagram|truth\s*social|x|@[a-z0-9_.-]+|email)',
                    re.IGNORECASE
                )
                all_paras = main_content.find_all("p")
                for p in all_paras:
                    para_text = p.get_text(strip=True)
                    if follow_pattern.search(para_text):
                        p.decompose()
                    elif "reporter for Breitbart News" in para_text:
                        p.decompose()
                    elif "Breitbart News Daily airs on SiriusXM" in para_text:
                        p.decompose()
                    elif "Order your copy today" in para_text:
                        p.decompose()

                # Extract text
                raw_text = main_content.get_text(separator=" ", strip=True)
                # Replace \xa0 with normal space
                clean_text = raw_text.replace("\xa0", " ")
                
                article_data["text"] = clean_text
            else:
                article_data["text"] = "Article body not found"
        
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")

    return article_data


urls= ['https://www.breitbart.com/2024-election/2025/01/19/trump-makes-triumphant-crowd-entrance-at-inaugural-eve-maga-rally/',
       'https://www.breitbart.com/politics/2025/01/19/supporters-gather-for-trumps-inauguration-eve-victory-rally/',
       'https://www.breitbart.com/law-and-order/2025/01/19/biden-grants-posthumous-pardon-to-black-nationalist-marcus-garvey/',
       'https://www.breitbart.com/politics/2025/01/19/trump-swear-in-with-personal-bible-and-lincoln-bible-during-inauguration/',
       'https://www.breitbart.com/politics/2025/01/19/poll-most-americans-one-word-summary-president-joe-bidens-legacy-nothing/',
       'https://www.breitbart.com/politics/2025/01/19/exclusive-tim-scott-releases-video-highlighting-american-renewal-ahead-trump-inauguration/',
       'https://www.breitbart.com/tech/2025/01/19/tiktok-restores-services-in-u-s-after-trump-says-he-will-issue-executive-order-delaying-ban/',
       'https://www.breitbart.com/politics/2025/01/19/jon-voight-matt-boyle-james-okeefe-tribute-andrew-breitbart-patriot-awards/',
       'https://www.breitbart.com/politics/2025/01/19/poll-most-americans-have-negative-view-outgoing-president-joe-bidens-time-office/',
       'https://www.breitbart.com/entertainment/2025/01/19/actor-cnn-show-host-michael-ian-black-wants-to-fuing-impeach-trump-before-hes-sworn-in/',
       'https://www.breitbart.com/entertainment/2025/01/19/michael-rapaport-celebrates-the-demise-of-dirty-biased-damn-near-soft-porn-dumphole-tiktok/',
       'https://www.breitbart.com/europe/2025/01/19/german-ambassador-warns-trump-administration-will-seek-redefinition-of-constitutional-order/',
       'https://www.breitbart.com/entertainment/2025/01/19/dave-chappelle-wishes-trump-the-best-in-snl-monologue-the-whole-world-is-counting-on-you/',
       'https://www.breitbart.com/tech/2025/01/19/google-defies-eus-fact-checking-requirements-for-search-and-youtube/',
       'https://www.breitbart.com/2nd-amendment/2025/01/19/six-people-shot-in-six-hours-in-brandon-johnsons-chicago/',
       'https://www.breitbart.com/politics/2025/01/19/nyt-l-a-firefighters-took-20-minutes-to-respond-to-palisades-fire/',
       'https://www.breitbart.com/europe/2025/01/19/millionaires-flee-uk-in-record-numbers-since-left-wing-labours-election-win/',
       'https://www.breitbart.com/europe/2025/01/19/london-mayor-sadiq-khan-warns-of-resurgent-fascism-with-trumps-return-to-power/',
       'https://www.breitbart.com/europe/2025/01/19/uk-police-accused-of-political-correctness-as-failure-to-record-ethnicity-of-criminals-soars/',
       'https://www.breitbart.com/tech/2025/01/19/tiktok-goes-dark-for-americans-hours-before-ban-takes-place/',
       'https://www.breitbart.com/politics/2025/01/18/virginia-gov-glenn-youngkin-orders-flags-to-be-flown-at-full-staff-for-trumps-inauguration/',
       'https://www.breitbart.com/politics/2025/01/18/adam-schiff-re-hires-former-radical-pro-palestinian-staffer-maher-bitar/',
       'https://www.breitbart.com/politics/2025/01/18/glenn-youngkin-slams-bidens-clemency-of-men-who-murdered-va-police-officer-beyond-outraged/',
       'https://www.breitbart.com/health/2025/01/18/hhs-defunds-ecohealth-dr-peter-daszak-for-facilitating-dangerous-wuhan-gain-of-function-research/',
       'https://www.breitbart.com/local/2025/01/18/georgia-black-lives-matter-mayor-under-fire-for-spending-thousands-of-taxpayer-funds-on-africa-trip/',
       'https://www.breitbart.com/politics/2025/01/18/former-intel-officials-director-national-intelligence-nominee-tulsi-gabbard-put-party-loyalty-far-behind-duty-to-country/',
       'https://www.breitbart.com/immigration/2025/01/18/poll-shows-post-election-crash-in-public-tolerance-for-illegal-migration/',
       'https://www.breitbart.com/politics/2025/01/18/https-www-breitbart-com-politics-2025-01-18-poll-plurality-either-enthusiastic-or-satisfied-with-next-4-years-of-trump/',
       'https://www.breitbart.com/politics/2025/01/18/exclusive-kentucky-agriculture-commissioner-swift-confirmation-usda-nominee-brooke-rollins-will-help-make-america-healthy-again/',
       'https://www.breitbart.com/politics/2025/01/18/nyt-exposes-six-people-responsible-trying-cover-up-joe-bidens-health/',
       'https://www.breitbart.com/the-media/2025/01/18/wapo-admits-trump-already-conquered-washington-before-inauguration/',
       'https://www.breitbart.com/entertainment/2025/01/18/snoop-dogg-rick-ross-attend-pre-inauguration-crypto-ball-both-fantasized-about-killing-trump/',
       'https://www.breitbart.com/politics/2025/01/18/brad-sherman-criticizes-pacific-palisades-residents-who-returned-to-their-homes/',
       'https://www.breitbart.com/middle-east/2025/01/18/israel-will-release-mass-murderers-terrorists-in-hostage-deal/',
       'https://www.breitbart.com/2nd-amendment/2025/01/18/aggressive-driver-shot-numerous-times-after-allegedly-chasing-couple/',
       'https://www.breitbart.com/politics/2025/01/18/exclusive-sen-ron-johnson-most-if-not-all-of-trumps-nominees-will-get-at-least-50-votes/',
       'https://www.breitbart.com/politics/2025/01/18/poll-plurality-either-enthusiastic-or-satisfied-with-next-4-years-of-trump/',
       'https://www.breitbart.com/politics/2025/01/18/poll-border-and-immigration-handling-identified-as-bidens-biggest-failure/',
       'https://www.breitbart.com/europe/2025/01/18/outrage-over-short-sentences-for-grooming-gang-child-rapists/',
       'https://www.breitbart.com/politics/2025/01/18/shark-tank-investor-kevin-oleary-offers-20-billion-cash-buy-chinas-tiktok/',
       'https://www.breitbart.com/2nd-amendment/2025/01/18/gop-rep-introduces-saga-act-if-you-like-your-ar-15-you-can-keep-your-ar-15/',
       'https://www.breitbart.com/europe/2025/01/18/geert-wilders-for-breitbart-unlike-in-trumps-america-euro-bigwigs-censor-to-protect-democracy-from-voters/',
       'https://www.breitbart.com/border/2025/01/18/maine-admits-licensing-transnational-criminal-organizations-to-grow-marijuana/',
       'https://www.breitbart.com/border/2025/01/18/graphic-cartel-gunmen-behead-rivals-while-government-officials-claim-crime-is-dramatically-decreasing/',
       'https://www.breitbart.com/europe/2025/01/18/ukrainian-police-raid-draft-dodgers-in-nationwide-conscription-searches/',
       'https://www.breitbart.com/health/2025/01/17/w-h-o-demands-world-1-5-billion-emergencies/',
       'https://www.breitbart.com/politics/2025/01/17/exclusive-sen-tim-sheehy-ushers-in-bold-new-era-of-maga-leadership-this-country-was-not-founded-by-65-year-old-bureaucrats/',
       'https://www.breitbart.com/politics/2025/01/17/exclusive-no-excuse-former-aerial-firefighter-sen-tim-sheehy-details-what-ca-got-wrong-ahead-of-wildfire-disaster/',
       'https://www.breitbart.com/politics/2025/01/17/elizabeth-warren-accuses-longtime-democrat-donor-sam-altman-of-seeking-favors-from-trump-after-1-million-inaugural-donation/',
       'https://www.breitbart.com/politics/2025/01/17/first-female-los-angeles-fire-chief-facing-calls-resign-blazes-rage/',
      ]


# Scrape articles and create a DataFrame
bb_data_df = scrape_multiple_articles(urls, scrape_bb_article)
# Store to CSV
bb_data_df.to_csv("polarised_scraped_articles_bb.csv", index=False)
# Print head 
bb_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,Trump Makes Triumphant Crowd Entrance at Inaugural Eve MAGA Rally,"President-elect Donald Trump made a triumphant entrance through the crowd at his Make America Great Again rally less than 24 hours before he is once again sworn in as president. Lee Greenwood delivered a live rendition of his iconic song, “God Bl...",Breitbart,2025-01-20T00:26:46+00:00,Politics,Polarised,https://www.breitbart.com/2024-election/2025/01/19/trump-makes-triumphant-crowd-entrance-at-inaugural-eve-maga-rally/
1,Supporters Gather for Trump's Inauguration Eve 'Victory Rally',"WASHINGTON, DC — Supporters of President-elect Donald Trump gathered from across the country on Sunday to attend the 45th and soon-to-be 47th president’s Inauguration Eve “victory rally” at Capital One Arena in Washington, DC, braving cold weathe...",Breitbart,2025-01-20T00:08:14+00:00,Politics,Polarised,https://www.breitbart.com/politics/2025/01/19/supporters-gather-for-trumps-inauguration-eve-victory-rally/
2,Biden Grants Posthumous Pardon to Black Nationalist Marcus Garvey,"In his final full day in office, President Joe Biden has given a posthumous pardon to black nationalist Marcus Garvey, who was convicted of mail fraud in the 1920s. Referring to the pan-Africanist and racial separatist as a “renowned civil rights...",Breitbart,2025-01-19T23:42:57+00:00,Politics,Polarised,https://www.breitbart.com/law-and-order/2025/01/19/biden-grants-posthumous-pardon-to-black-nationalist-marcus-garvey/
3,Trump to Swear In with Personal Bible and Lincoln Bible During Inauguration,"President-elect Donald Trump will swear into office with two Bibles — one given to him by his mother, and one that President Abraham Lincoln used in his own inaugural ceremony in 1861. The incoming president’s personal Bible was gifted to him by ...",Breitbart,2025-01-19T23:23:34+00:00,Politics,Polarised,https://www.breitbart.com/politics/2025/01/19/trump-swear-in-with-personal-bible-and-lincoln-bible-during-inauguration/
4,Poll: Most Americans One-Word Summary of Biden's Legacy Is 'Nothing',"Americans who participated in a recent poll appear to view President Joe Biden’s (D) time in the White House as a big nothingburger. A poll conducted for the Daily Mail by J.L. Partners with some 1,009 registered voters asked how they would descr...",Breitbart,2025-01-19T22:51:12+00:00,Politics,Polarised,https://www.breitbart.com/politics/2025/01/19/poll-most-americans-one-word-summary-president-joe-bidens-legacy-nothing/


**Daily Kos**

Articles were scraped from the Staff Stories section, sorted in reverse chronological order. https://www.dailykos.com/history/list/staff?pm_source=main_nav_dropdown&pm_medium=web Only articles by "Daily Kos staff" were chosen. "The Recap" posts were also skipped.

In [7]:
def scrape_kos_article(url):
    """
    Scrapes an article from a given URL on https://www.dailykos.com and extracts relevant information.
    
    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Polarised",
        "url": url
    }

    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            #print(soup)

            # Title
            title_meta = soup.find('meta', property='og:title')
            article_data["title"] = title_meta['content'] if title_meta else "Title not found"
            
            # URL
            url_meta = soup.find('meta', property='og:url')
            article_data["url"] = url_meta['content'] if url_meta else url
            
            # Site name
            site_name_meta = soup.find('meta', property='og:site_name')
            article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
            
            # Published date
            published_date_meta = soup.find('meta', property='article:published_time')
            if not published_date_meta:
                # Fallback to noscript timestamp
                timestamp_span = soup.select_one(".story__timestamp span.timestamp")
                if timestamp_span and 'data-epoch-time' in timestamp_span.attrs:
                    # Convert timestamp to human-readable date
                    epoch_time = int(timestamp_span['data-epoch-time']) / 1000  # Convert milliseconds to seconds
                    human_readable_date = datetime.utcfromtimestamp(epoch_time).strftime('%Y-%m-%d %H:%M:%S')
                    article_data["date"] = human_readable_date
                else:
                    article_data["date"] = "Published date not found"
            else:
                article_data["date"] = published_date_meta['content']
                
            # Category
            category_meta = soup.find('meta', property='article:section')
            article_data["category"] = category_meta['content'] if category_meta else "Category not found"

            # Article text
            story_content_divs = [
                div for div in soup.find_all('div', class_='story__text')
                if 'placeholder' not in div.get('class', [])
            ]
            
            if story_content_divs:
                paragraphs = []
                exclusion_phrases = [
                    "Donate now to support",
                    "Join us on Bluesky", "Bluesky Starter Pack", "staff accounts on Bluesky", "Daily Kos is on Bluesky",
                    "Your reader support means everything", "please donate just $3", 
                    "value having free and reliable access", "Daily Kos is supported by readers like you.", "Can you chip in today?"
                ]
                
                for div in story_content_divs:
                    for p in div.find_all('p', recursive=False):  # Direct <p> children only
                        text = p.get_text(strip=True)
                        if not any(phrase in text for phrase in exclusion_phrases) and not text.startswith("Donate now to support"):
                            paragraphs.append(text)
                
                article_data["text"] = ' '.join(paragraphs)

        
        else:
            print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {e}")

    return article_data


urls= [
    'https://www.dailykos.com/stories/2025/1/21/2298232/-How-the-right-is-justifying-and-even-celebrating-Musk-s-Nazi-salute',
    'https://www.dailykos.com/stories/2025/1/21/2298193/-Trump-immediately-reminds-America-of-his-pettiness-and-fragile-ego',
    'https://www.dailykos.com/stories/2025/1/21/2297340/-Why-the-internet-is-blaming-a-billionaire-couple-for-California-fires',
    'https://www.dailykos.com/stories/2025/1/21/2298201/-Mainstream-media-fails-the-Trump-test-on-his-first-day-back',
    'https://www.dailykos.com/stories/2025/1/21/2298037/-Trump-s-speech-raises-the-old-question-Is-he-evil-or-merely-stupid',
    'https://www.dailykos.com/stories/2025/1/21/2298176/-Trump-hit-with-barrage-of-lawsuits-over-his-lawless-Day-1-actions',
    'https://www.dailykos.com/stories/2025/1/21/2298174/-Law-and-order-Trump-slammed-for-freeing-terrorists-and-Nazi-sympathizers',
    'https://www.dailykos.com/stories/2025/1/21/2297105/-Why-is-Trump-stockpiling-so-much-cash',
    'https://www.dailykos.com/stories/2025/1/20/2298007/-9-Trump-inauguration-promises-that-won-t-lower-the-price-of-your-eggs',
    'https://www.dailykos.com/stories/2025/1/20/2297968/-DOGE-makes-its-first-cut-as-co-chair-Ramaswamy-reportedly-dumped',
    'https://www.dailykos.com/stories/2025/1/19/2297155/-Meet-the-unlikely-Democrats-who-could-revive-the-weary-party',
    'https://www.dailykos.com/stories/2025/1/19/2297351/-What-migration-to-red-states-could-mean-for-the-Democratic-Party',
    'https://www.dailykos.com/stories/2025/1/18/2297485/-Explaining-the-Right-The-imaginary-scourge-of-noncitizen-voting',
    'https://www.dailykos.com/stories/2025/1/18/2297478/-Here-s-the-moment-Mark-Zuckerberg-gave-away-the-game',
    'https://www.dailykos.com/stories/2025/1/18/2297143/-Trump-and-his-goons-are-already-bailing-on-mass-deportation-pledge',
    'https://www.dailykos.com/stories/2025/1/21/2298174/-Law-and-order-Trump-slammed-for-freeing-terrorists-and-Nazi-sympathizers',
    'https://www.dailykos.com/stories/2025/1/17/2297520/-Trump-s-tiny-inauguration-is-making-MAGA-morons-lose-their-minds',
    'https://www.dailykos.com/stories/2025/1/17/2297500/-Democrats-demand-tech-bros-explain-cash-being-tossed-at-Trump',
    'https://www.dailykos.com/stories/2025/1/17/2297463/-Trump-s-family-wastes-no-time-profiting-off-of-Daddy-s-second-term',
    'https://www.dailykos.com/stories/2025/1/17/2297474/-Trump-loving-tech-bro-reportedly-set-to-run-for-Ohio-governor',
    'https://www.dailykos.com/stories/2025/1/17/2297462/-Dog-killer-Noem-pushes-immigration-and-wildfire-lies-at-Senate-hearing',
    'https://www.dailykos.com/stories/2025/1/17/2297447/-Biden-shows-support-for-ERA-but-does-that-actually-mean-anything',
    'https://www.dailykos.com/stories/2025/1/17/2297437/-Trump-inauguration-will-be-too-frigid-even-for-that-coldhearted-man',
    'https://www.dailykos.com/stories/2025/1/17/2297440/-CNN-reportedly-hiding-anchor-who-dares-to-call-out-Trump-s-BS',
    'https://www.dailykos.com/stories/2025/1/17/2297414/-Chaos-among-GOP-lawmakers-threatens-Trump-s-agenda-already',
    'https://www.dailykos.com/stories/2025/1/17/2297412/-TikTok-announces-pro-Trump-bash-just-as-Supreme-Court-upholds-ban',
    'https://www.dailykos.com/stories/2025/1/17/2297266/-Democrats-prepare-to-fight-Trump-s-Day-1-executive-orders-in-the-courts',
    'https://www.dailykos.com/stories/2025/1/16/2297286/-Infamous-Jew-hating-racist-Mel-Gibson-gets-special-job-from-Trump',
    'https://www.dailykos.com/stories/2025/1/16/2297298/-Even-Republicans-aren-t-happy-about-House-speaker-s-latest-cave-to-Trump',
    'https://www.dailykos.com/stories/2025/1/16/2297303/-Trump-treasury-nominee-slammed-for-opposing-minimum-wage-increase',
    'https://www.dailykos.com/stories/2025/1/16/2297294/-Rudy-Giuliani-has-another-terrible-no-good-very-bad-day',
    'https://www.dailykos.com/stories/2025/1/16/2297251/-Another-MAGA-loyalist-eyes-Florida-governor-s-race',
    'https://www.dailykos.com/stories/2025/1/16/2297290/-Democrats-have-one-weird-trick-to-release-the-rest-of-Jack-Smith-s-report',
    'https://www.dailykos.com/stories/2025/1/16/2297252/-Seth-Meyers-fires-back-after-Trump-called-him-marble-mouth',
    'https://www.dailykos.com/stories/2025/1/16/2297278/-Nancy-Pelosi-will-skip-Trump-s-inauguration',
    'https://www.dailykos.com/stories/2025/1/16/2297254/-House-Republicans-take-first-step-in-mass-deportation-scheme',
    'https://www.dailykos.com/stories/2025/1/16/2297235/-Trump-mimics-mugshot-in-official-presidential-portrait',
    'https://www.dailykos.com/stories/2025/1/16/2297233/-Trump-has-a-list-of-enemies-and-he-wants-everyone-to-see-it',
    'https://www.dailykos.com/stories/2025/1/16/2297228/-House-speaker-demotes-Ohio-Republican-who-won-t-bend-the-knee-to-Trump',
    'https://www.dailykos.com/stories/2025/1/15/2297102/-Trump-s-attorney-general-pick-won-t-admit-Biden-won-2020-election',
    'https://www.dailykos.com/stories/2025/1/15/2297139/-Trump-tries-to-take-full-credit-for-Israel-Hamas-ceasefire-deal',
    'https://www.dailykos.com/stories/2025/1/15/2297118/-Trump-pushes-DOGE-dork-to-fill-Vance-s-vacant-Senate-seat-in-Ohio',
    'https://www.dailykos.com/stories/2025/1/15/2297079/-Even-Fox-News-can-t-spin-how-much-Greenlanders-don-t-want-to-join-US',
    'https://www.dailykos.com/stories/2025/1/15/2297088/-Pam-Bondi-is-still-covering-up-Trump-s-Georgia-election-fraud-efforts',
    'https://www.dailykos.com/stories/2025/1/15/2297068/-Trump-won-the-popular-vote-but-that-doesn-t-mean-Americans-like-him',
    'https://www.dailykos.com/stories/2025/1/15/2297080/-Fox-News-host-celebrates-lack-of-diverse-inauguration-performers',
    'https://www.dailykos.com/stories/2025/1/15/2296959/-Biden-spends-final-days-in-office-easing-the-housing-crisis',
    'https://www.dailykos.com/stories/2025/1/15/2297074/-GOP-senator-ditches-skepticism-and-backs-Pete-Hegseth-to-lead-Pentagon',
    'https://www.dailykos.com/stories/2025/1/15/2297061/-Notorious-GOP-bigot-tries-to-pick-literal-fight-with-Texas-Democrat',
    'https://www.dailykos.com/stories/2025/1/15/2296980/-Trump-s-tech-sugar-daddies-bestowed-VIP-seats-for-Inauguration-Day',
      ]


# Scrape articles and create a DataFrame
kos_data_df = scrape_multiple_articles(urls, scrape_kos_article)
# Store to CSV
kos_data_df.to_csv("polarised_scraped_articles_kos.csv", index=False)
# Print head 
kos_data_df.head()

Unnamed: 0,title,text,site,date,category,class,url
0,How the right is justifying—and even celebrating—Musk’s Nazi salute,"Feeling emboldened by Donald Trump officially being sworn in as president, the richest man in the world, Elon Musk, gave a nod to Germany’s Nazi Party—you know, the one thatkilled six millionJewish people and started World War II. “Thank you for ...",Daily Kos,2025-01-21 22:00:11,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/21/2298232/-How-the-right-is-justifying-and-even-celebrating-Musk-s-Nazi-salute
1,Trump immediately reminds America of his pettiness and fragile ego,"Less than 24 hours into his second term, Donald Trump is already trying to settle scores with his perceived enemies, taking multiple petty actions to stroke his giant and fragile ego. Less than two hours after being sworn in, Trump had a portrait...",Daily Kos,2025-01-21 21:00:08,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/21/2298193/-Trump-immediately-reminds-America-of-his-pettiness-and-fragile-ego
2,Why the internet is blaming a billionaire couple for California fires,"Online creators have been buzzing aboutStewart and Lynda Resnick, billionaires who own a portion of Southern California’s water supply, claiming that the couple is to blame for fire hydrantsrunning dryamid the ongoing wildfires across the region....",Daily Kos,2025-01-21 20:00:22,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/21/2297340/-Why-the-internet-is-blaming-a-billionaire-couple-for-California-fires
3,'Great president': Mainstream media slobbers all over Trump,"Early in Donald Trump’s new presidency, mainstream media outlets confirmed fears of how they would cover him—by avoiding the truth, equivocating on his abuse of power, and even praising him. The actions by widely read and watched outlets was the ...",Daily Kos,2025-01-21 18:00:12,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/21/2298201/-Mainstream-media-fails-the-Trump-test-on-his-first-day-back
4,Trump's speech raises the old question: Is he evil or merely stupid?,"In Donald Trump’s new conceit, America was a hellhole in 2016, he made it great, then former President Joe Biden broke everything, but thanks to Trump’s 2024 election win, everything will be great again. Or, as he said inhis inaugural addresson M...",Daily Kos,2025-01-21 17:00:09,Category not found,Polarised,https://www.dailykos.com/stories/2025/1/21/2298037/-Trump-s-speech-raises-the-old-question-Is-he-evil-or-merely-stupid


In [8]:
# Combine DataFrames
polarised_dataset = pd.concat(
    [tcw_data_df, can_data_df, bb_data_df, kos_data_df],
    ignore_index=True
)

# Basic checks
print(polarised_dataset.info())   # Data types & non-null counts
print(polarised_dataset.head())   # Quick glance at first rows

# Print out the categories
print(polarised_dataset["category"].value_counts())

# Confirm 4 sites are represented
print("Number of unique sites:", polarised_dataset["site"].nunique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     200 non-null    object
 1   text      200 non-null    object
 2   site      200 non-null    object
 3   date      200 non-null    object
 4   category  200 non-null    object
 5   class     200 non-null    object
 6   url       200 non-null    object
dtypes: object(7)
memory usage: 11.1+ KB
None
                                                                  title  \
0                      The UK grooming gang scandal is a Galileo moment   
1                The progressives' contempt for the white working class   
2  How dare Starmer reject a public inquiry into Muslim grooming gangs?   
3                               Welcome to the year of the 'Quad-demic'   
4                  How Labour is seizing more control over our children   

                                                         

In [9]:
# Store to CSV
polarised_dataset.to_csv("polarised_articles.csv", index=False)

### 3. Satire
Satirical content is intended to entertain or provoke thought through humor, exaggeration, or irony. Satire is often misunderstood as factual. 

##### Features:

- Humourous or Exaggerated Tone: Content is typically marked by wit, parody, or absurdity.
- Intentional Ridiculousness: The story is meant to be funny, not factual; outlandish claims serve comedic purposes.

##### Label If:

- The piece’s goal is clearly comedic or parodic, rather than deceptive.
- The tone, language, or disclaimers indicate it’s intentionally satirical.

##### Do Not Label If:

- The piece uses humour but is still intended to mislead (label as Fabricated Content).
- The piece is comedic but still pushing a heavily skewed narrative as if it’s true (label as Polarised Content).

##### Sources:
- The Onion (USA - 55 articles)
- Babylon Bee (USA - 50 articles)
- The Daily Squib (UK - 45 articles)
- Waterford Whispers (IE - 50 articles)


**The Onion**

The articles scraped are the ones featured on the 2024 "Annual Year" post found here: https://theonion.com/our-annual-year-2024/ - the top 5 from each month have been chosen (image posts have been excluded as per scope), so a total of 55 articles a December has not been included in their roundup.

In [None]:
def scrape_onion_article(url):
    """
    Scrapes an article from a given URL on theonion.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", #satire is hardcoded here as we know TheOnion is a satire site
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Category
        category_element = soup.find('div', class_='taxonomy-category')
        category_link = category_element.find('a') if category_element else None
        article_data["category"] = category_link.text.strip() if category_link else "Category not found"
        
        # Article copy
        content_div = soup.find(
            "div",
            {"class": lambda x: x and "entry-content" in x and "single-post-content" in x}
        )
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
            article_data["text"] = full_text
        else:
            article_data["text"] = "Article text not found"
    
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data


# List of URLs to scrape
urls = [
    #January
    "https://theonion.com/biden-addresses-nation-while-hanging-from-branch-on-sid-1851106795/",
    "https://theonion.com/marriage-counselor-sides-with-hotter-spouse-1851143488/",
    "https://theonion.com/wealthy-dad-surprises-child-with-tree-house-he-can-airb-1851112919/",
    "https://theonion.com/glowing-pulsating-hair-product-takes-control-of-gavin-1851160421/",
    "https://theonion.com/gen-z-announces-julie-andrews-is-problematic-but-refuse-1851180352/",
    #February
    "https://theonion.com/mrbeast-announces-he-has-resurrected-everyone-buried-at-1851217565/",
    "https://theonion.com/introverted-cowboy-struggling-to-round-up-posse-1851226175/",
    "https://theonion.com/country-stations-refuse-to-play-beyonce-s-music-after-a-1851261135/",
    "https://theonion.com/stab-him-stab-him-you-cowards-says-terrified-kamal-1851243467/",
    "https://theonion.com/emerging-filmmaker-malia-obama-changes-surname-to-scors-1851278946/",
    #March
    "https://theonion.com/u-s-airdrops-rubble-into-gaza-1851305713/",
    "https://theonion.com/ozempic-maker-triumphantly-announces-new-drug-that-make-1851320436/",
    "https://theonion.com/study-millennial-women-forgoing-dating-apps-in-favor-o-1851338275/",
    "https://theonion.com/beyonce-reveals-new-country-album-cover-featuring-tooth-1851355991/",
    "https://theonion.com/but-dog-likes-fighting-for-money-1851352386/",
    #April
    "https://theonion.com/finance-whiz-has-over-300-in-bank-account-1851375065/",
    "https://theonion.com/sotheby-s-announces-auction-of-napkin-on-which-jeffrey-1851375213/",
    "https://theonion.com/o-j-simpson-allowed-to-remain-living-after-coffin-does-1851403804/",
    "https://theonion.com/travis-kelce-impresses-coachella-crowd-by-tossing-taylo-1851410856/",
    "https://theonion.com/biden-carried-away-by-ants-1851422363/",
    #May
    "https://theonion.com/tesla-lays-off-entire-team-behind-brakes-1851449223/",
    "https://theonion.com/drake-drops-new-track-inviting-kendrick-lamar-out-to-co-1851458534/",
    "https://theonion.com/perdue-announces-initiative-to-even-the-playing-field-b-1851423157/",
    "https://theonion.com/new-florida-law-requires-all-women-to-produce-3-healthy-1851482288/",
    "https://theonion.com/everyone-in-er-bit-off-finger-while-holding-sandwich-1851488798/",
    #June
    "https://theonion.com/cult-leader-not-even-charismatic-1851512851/",
    "https://theonion.com/embarrassed-david-attenborough-realizes-he-spent-10-min-1851512951/",
    "https://theonion.com/newest-u-s-aid-mission-just-single-powerbar-labeled-f-1851540802/",
    "https://theonion.com/report-every-place-on-earth-has-wrong-amount-of-water-1851544516/",
    "https://theonion.com/nasa-warns-space-hawk-has-swooped-in-and-picked-up-eart-1851544578/",
    #July
    "https://theonion.com/clarence-thomas-torn-over-case-where-both-sides-offer-c-1851566812/",
    "https://theonion.com/democrats-panic-after-kamala-harris-ages-40-years-in-si-1851601473/",
    "https://theonion.com/congress-bans-roofs-1851592883/",
    "https://theonion.com/news-happening-faster-than-man-can-generate-uninformed-1851601466/",
    "https://theonion.com/god-forced-to-shave-head-after-contracting-plague-of-li-1851580149/",
    #August
    "https://theonion.com/environmentalists-warn-u-s-running-out-of-small-wooded-1851609190/",
    "https://theonion.com/r-kelly-petitions-supreme-court-to-watch-him-pee-1851619802rev1723482404693/",
    "https://theonion.com/federated-union-of-bear-cub-carcass-dumpers-endorses-rf-1851613425/",
    "https://theonion.com/glen-powell-opens-up-about-dangerous-stunt-work-filming-with-sydney-sweeneys-breasts/",
    "https://theonion.com/j-d-vance-accuses-tim-walz-of-stolen-valor-for-wearing-1851621120/",
    #September
    "https://theonion.com/everyone-in-restaurant-jealous-of-toddler-who-gets-to-wear-pajamas-and-watch-ipad/",
    "https://theonion.com/horrified-taylor-swift-realizes-football-happens-every-year/",
    "https://theonion.com/trump-avoids-answering-hard-questions-by-pretending-he-shot-in-ear-again/",
    "https://theonion.com/man-replies-stop-to-political-fundraiser-text-like-powerful-wizard-casting-spell-to-ward-off-mythical-beast/",
    "https://theonion.com/scarecrow-has-double-ds/",
    #October
    "https://theonion.com/the-onion-officially-endorses-joe-biden-for-president/",
    "https://theonion.com/texas-sex-ed-class-teaches-boys-how-to-cheat-on-pregnant-wife/",
    "https://theonion.com/sabrina-carpenter-completes-mandatory-service-in-south-korean-military/",
    "https://theonion.com/north-carolina-family-informed-their-insurance-policy-voided-once-house-gets-wet/",
    "https://theonion.com/grandma-who-survived-great-depression-casually-drops-that-she-once-killed-man-for-mayonnaise/",
    #November
    "https://theonion.com/piss-soaked-tucker-carlson-claims-demon-urinated-on-him-while-he-slept/",
    "https://theonion.com/trump-calls-harris-to-congratulate-himself-on-winning/",
    "https://theonion.com/america-defeats-america/",
    "https://theonion.com/man-forgetting-difference-between-meteoroid-meteorite-struggles-to-describe-what-just-killed-his-dog/",
    "https://theonion.com/every-movement-in-mans-burrito-eating-technique-informed-by-past-burrito-tragedies/"
]

# Scrape articles and create a DataFrame
onion_data_df = scrape_multiple_articles(urls, scrape_onion_article)
# Store to CSV
onion_data_df.to_csv("satire_scraped_articles_onion.csv", index=False)
# Print head 
onion_data_df.head()

**Babylon Bee**

The top 50 articles from the Greatest Hits page (https://babylonbee.com/news?sort=greatest-hits) have been scraped. The categories "Christian Living" and "Scripture" were excluded for being too niche. 


In [None]:
urls = [
    "https://babylonbee.com/news/trump-i-have-done-more-for-christianity-than-jesus",
    "https://babylonbee.com/news/senate-to-be-replaced-with-room-full-of-monkeys-throwing-feces",
    "https://babylonbee.com/news/motorcycle-that-identifies-as-bicycle-sets-world-cycling-record",
    "https://babylonbee.com/news/trumps-says-5-golden-tickets-to-be-hidden-among-stimulus-checks",
    "https://babylonbee.com/news/nfl-to-adorn-all-uniforms-with-lace-doilies-in-to-honor-rbg",
    "https://babylonbee.com/news/pelosi-rips-up-bible",
    "https://babylonbee.com/news/biden-cuts-holes-in-medical-mask-so-he-can-still-sniff-people",
    "https://babylonbee.com/news/man-identifying-6-year-old-crushes-game-winning-homer-tee-ball-championship",
    "https://babylonbee.com/news/biden-i-am-the-only-candidate-who-can-beat-ronald-reagan",
    "https://babylonbee.com/news/fisher-price-introduces-supreme-court-protest-playhouse-that-can-be-vandalized-and-burned-down",
    "https://babylonbee.com/news/cracker-jacks-changes-name-to-more-politically-correct-caucasian-jacks",
    "https://babylonbee.com/news/cdc-people-dirt-clintons-843-greater-risk-suicide",
    "https://babylonbee.com/news/walmart-requiring-all-shoppers-to-wear-pants",
    "https://babylonbee.com/news/ilhan-omar-withdraws-support-from-bill-to-save-the-earth-after-learning-thats-where-israel-is",
    "https://babylonbee.com/news/inspiring-celebrities-spell-out-were-all-in-this-together-with-their-yachts",
    "https://babylonbee.com/news/democrats-warn-that-american-people-may-tamper-with-next-election",
    "https://babylonbee.com/news/people-who-tweet-in-support-of-foreign-wars-to-be-automatically-enlisted-in-armed-forces",
    "https://babylonbee.com/news/bernie-sanders-praises-china-for-eradicating-poverty-by-killing-all-the-poor-people",
    "https://babylonbee.com/news/pence-cancels-general-election-to-stymie-coronavirus",
    "https://babylonbee.com/news/walmart-discontinues-sale-of-auto-parts-to-prevent-car-accidents",
    "https://babylonbee.com/news/federal-prison-hires-top-rated-italian-bodyguard-hillena-clintonelli-to-protect-ghislaine-maxwell",
    "https://babylonbee.com/news/kim-jong-un-attends-ivy-league-university-to-learn-new-brainwashing-techniques",
    "https://babylonbee.com/news/florida-recount-finally-wraps-up-al-gore-declared-president",
    "https://babylonbee.com/news/powerful-protesters-spell-out-love-with-burning-homes-and-businesses",
    "https://babylonbee.com/news/joel-osteen-tests-positive-for-heresy",
    "https://babylonbee.com/news/caravan-of-liberal-americans-makes-way-toward-socialist-paradise-of-venezuela",
    "https://babylonbee.com/news/in-genius-move-trump-supports-impeachment-forcing-democrats-to-oppose-it",
    "https://babylonbee.com/news/cnn-publishes-real-news-story-for-april-fools-day",
    "https://babylonbee.com/news/government-accidentally-shuts-itself-down-with-ban-on-non-essential-businesses",
    "https://babylonbee.com/news/wife-unaware-that-movie-will-answer-all-her-questions-if-she-just-pays-attention",
    "https://babylonbee.com/news/bernie-sanders-arrives-in-hong-kong-to-lecture-protesters-on-how-good-they-have-it-under-communism",
    "https://babylonbee.com/news/jussie-smollett-offered-job-at-cnn-after-fabricating-news-story-out-of-thin-air",
    "https://babylonbee.com/news/portland-police-wish-there-were-some-kind-of-organized-armed-force-that-could-fight-back-against-antifa",
    "https://babylonbee.com/news/to-celebrate-move-to-texas-tesla-introduces-battery-powered-ar-15",
    "https://babylonbee.com/news/genius-trump-nominates-joe-biden-to-supreme-court",
    "https://babylonbee.com/news/hillary-clinton-accidentally-posts-condolences-for-tulsi-gabbards-suicide-one-day-early",
    "https://babylonbee.com/news/twitter-shuts-down-entire-network-to-slow-spread-of-negative-biden-news",
    "https://babylonbee.com/news/celebrities-show-solidarity-with-protesters-by-burning-their-own-homes-to-the-ground",
    "https://babylonbee.com/news/lego-introduces-new-sharper-bricks-that-instantly-kill-you-when-you-step-on-them",
    "https://babylonbee.com/news/democrats-call-for-flags-to-be-flown-half-mast-to-grieve-death-of-soleimani",
    "https://babylonbee.com/news/californians-brace-for-deadly-50-degree-cold-front",
    "https://babylonbee.com/news/brilliant-trump-puts-himself-on-all-postage-stamps-forcing-democrats-to-abolish-the-usps",
    "https://babylonbee.com/news/nations-nerds-wake-up-in-utopia-where-everyone-stays-inside-sports-canceled-social-interaction-forbidden",
    "https://babylonbee.com/news/hollywood-rushes-to-make-pedophilia-acceptable-before-theyre-outed-by-ghislaine-maxwell",
    "https://babylonbee.com/news/as-part-of-settlement-with-nick-sandmann-cnn-hosts-must-wear-maga-hats-while-on-the-air",
    "https://babylonbee.com/news/biden-campaign-says-he-is-so-close-to-a-vp-pick-he-can-smell-her",
    "https://babylonbee.com/news/trump-says-to-drink-lots-of-water-media-reports-as-deranged-trump-tells-everyone-to-drown-themselves",
    "https://babylonbee.com/news/starbucks-unveils-new-satanic-holiday-cups",
    "https://babylonbee.com/news/bill-clinton-allegations-of-sexual-misconduct-should-disqualify-a-man-from-public-office",
    "https://babylonbee.com/news/joel-osteen-launches-line-pastoral-wear-sheeps-clothing"
]

def scrape_bee_article(url):
    """
    Scrapes an article from a given URL on babylonbee.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", #satire is hardcoded here as we know BabylonBee is a satire site
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date       
        published_date_meta = soup.find('meta', {"name": "published_at"})
        if published_date_meta and published_date_meta.get("content"):
            article_data["date"] = published_date_meta["content"].split()[0]
        else: "Published date not found"
        
        # Category
        category_link = soup.find("a", href=lambda href: href and "/news/categories/" in href)
        if category_link:
            article_data["category"] = category_link.get_text(strip=True)
        else:
            article_data["category"] = "Category not found"
            
        # Article copy
        content_div = soup.find("div", class_="text-lg mt-6 leading-6 text-gray-700 article-content mx-2 sm:mx-0")
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = full_text.strip()
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# Scrape articles and create a DataFrame
bee_data_df = scrape_multiple_articles(urls, scrape_bee_article)
# Store to CSV
bee_data_df.to_csv("satire_scraped_articles_bee.csv", index=False)
# Print df 
bee_data_df.head()

**The Daily Squib**

45 articles were taken from the "Most Popular" page: https://www.dailysquib.co.uk/category/most-popular

In [None]:
urls = [
    "https://www.dailysquib.co.uk/entertainment/58255-bonus-good-newsfor-american-men-unhinged-batsht-crazy-liberal-women-going-celibate.html",
    "https://www.dailysquib.co.uk/world/56316-rachel-maddow-concerned-trump-will-put-her-in-fema-camp-during-second-term-yes-im-worried.html",
    "https://www.dailysquib.co.uk/entertainment/55825-im-a-cruise-ship-worker-these-are-the-six-things-smart-passengers-always-do-onboard.html",
    "https://www.dailysquib.co.uk/world/56106-labour-plan-to-have-speakers-and-listening-devices-on-every-lamp-post.html",
    "https://www.dailysquib.co.uk/sci_tech/55812-personal-computers-and-smartphones-were-introduced-for-benefit-of-ai.html",
    "https://www.dailysquib.co.uk/entertainment/55590-analysis-was-prince-harry-making-a-statement-in-latest-address.html",
    "https://www.dailysquib.co.uk/world/54018-interconnected-the-internet-only-creates-war-for-humanity.html",
    "https://www.dailysquib.co.uk/world/48511-world-economic-forum-brutal-totalitarian-communist-china-is-model-for-western-nations.html",
    "https://www.dailysquib.co.uk/entertainment/48277-matt-hancock-found-with-huge-amounts-of-midazolam-in-jungle.html",
    "https://www.dailysquib.co.uk/world/47729-the-beginning-of-the-post-consumerist-era.html",
    "https://www.dailysquib.co.uk/world/45624-netflix-harry-and-meghan-enjoy-themselves-exploiting-disabled-veterans-for-cash.html",
    "https://www.dailysquib.co.uk/entertainment/41634-experts-meghan-markle-thought-she-could-move-up-rank-in-royal-family.html",
    "https://www.dailysquib.co.uk/entertainment/41343-boo-hoo-you-made-me-cry.html",
    "https://www.dailysquib.co.uk/world/41322-is-harry-now-a-national-security-threat.html",
    "https://www.dailysquib.co.uk/entertainment/41208-first-transgender-woman-crowned-miss-minnesota-2021.html",
    "https://www.dailysquib.co.uk/entertainment/41200-meghan-markle-endures-bird-shit-trauma-during-oprah-interview.html",
    "https://www.dailysquib.co.uk/world/41113-meghan-markle-demands-english-county-of-sussex-is-moved-to-california.html",
    "https://www.dailysquib.co.uk/entertainment/39995-queen-meghan-and-king-harry-of-america-knight-their-gardener.html",
    "https://www.dailysquib.co.uk/world/39382-hunter-biden-i-cant-wait-to-move-into-white-house-to-smoke-crack-rocks.html",
    "https://www.dailysquib.co.uk/world/39255-obama-and-hunter-biden-sold-out-america-to-the-highest-bidder-china.html",
    "https://www.dailysquib.co.uk/world/39205-the-biden-incest-plot-thickens.html",
    "https://www.dailysquib.co.uk/entertainment/38604-keeping-up-with-the-sussexes-netflix-series-coming-in-december.html",
    "https://www.dailysquib.co.uk/world/38599-trump-to-open-presidential-library-of-authors-books-written-about-how-bad-he-is.html",
    "https://www.dailysquib.co.uk/entertainment/38512-meghan-is-imitating-diana-after-years-of-study.html",
    "https://www.dailysquib.co.uk/entertainment/38263-another-tedious-megan-markle-lecture-to-the-unwashed-masses.html",
    "https://www.dailysquib.co.uk/entertainment/38146-meghan-markle-pees-in-extensive-gardens-instead-of-using-16-bathrooms.html",
    "https://www.dailysquib.co.uk/world/38067-ironic-that-blm-antifa-heroes-karl-marx-and-engels-thought-blacks-closer-to-animal-kingdom-than-whites.html",
    "https://www.dailysquib.co.uk/world/38023-why-is-michelle-obama-so-depressed.html",
    "https://www.dailysquib.co.uk/world/37706-blm-campaign-failure-are-black-people-more-hated-now-than-before-riots.html",
    "https://www.dailysquib.co.uk/world/37612-mount-rushmore-presidents-could-be-replaced-by-blm-and-metoo-founders.html",
    "https://www.dailysquib.co.uk/world/37341-blm-doing-to-white-people-what-the-nazis-did-to-jews-dehumanizing-them.html",
    "https://www.dailysquib.co.uk/world/37294-george-floyd-to-be-sainted-by-pope-in-america.html",
    "https://www.dailysquib.co.uk/world/37236-intelligence-china-encouraging-blm-antifa-rioters-across-u-s-cities.html",
    "https://www.dailysquib.co.uk/world/36763-civilian-harry-misses-his-old-life-and-regrets-listening-to-meghan-markle.html",
    "https://www.dailysquib.co.uk/world/36586-meghan-markle-appealing-to-trump-to-end-coronavirus-pandemic-because-her-headlines-are-gone.html",
    "https://www.dailysquib.co.uk/entertainment/36508-archehole-harry-and-meghan-reveal-new-money-making-venture.html",
    "https://www.dailysquib.co.uk/world/35810-coronavirus-cui-bono-who-benefits.html",
    "https://www.dailysquib.co.uk/world/35975-meghan-reveals-harry-suffers-from-post-traumatic-royal-disorder.html",
    "https://www.dailysquib.co.uk/world/35896-will-harry-ever-forgive-meghan-for-her-crime.html",
    "https://www.dailysquib.co.uk/entertainment/35738-disney-could-replace-meghan-markle-with-plank-of-wood.html",
    "https://www.dailysquib.co.uk/entertainment/35629-defiant-meghan-markle-vs-windsor-royal-family.html",
    "https://www.dailysquib.co.uk/entertainment/35605-thomas-markle-bans-meghan-and-harry-from-using-markle-brand.html",
    "https://www.dailysquib.co.uk/world/35419-chinese-water-supply-contains-faecal-matter-aiding-spread-of-coronavirus.html",
    "https://www.dailysquib.co.uk/world/35341-chinese-authorities-misreporting-coronavirus-deaths.html",
    "https://www.dailysquib.co.uk/world/35315-remainer-tears-to-be-used-to-generate-electricity-for-britain.html",
]

def scrape_squib_article(url):
    """
    Scrapes an article from a given URL on dailysquib.co.uk and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", #satire is hardcoded here as we know the Daily Squib is a satire site
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date        
        published_meta = soup.find("meta", property="article:published_time")
        if published_meta and published_meta.get("content"):
            article_data["date"] = published_meta["content"].split("T")[0]
        
        # Category
        category_div = soup.find("div", class_="tdb-category td-fix-index")
        if category_div:
            cat_links = category_div.find_all("a", class_="tdb-entry-category")
            if cat_links:
                categories = [
                    #ignore "most popular"
                    a.get_text(strip=True) for a in cat_links if a.get_text(strip=True).lower() != "most popular"
                ]  
                #if multiple categories, return the first
                article_data["category"] = categories[0]
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"

        # Extract the article text
        content_div = soup.find("div", class_="td-post-content")
        
        if content_div:
            # remove blockquotes (e.g. embedded tweets)
            for bq in content_div.find_all("blockquote"):
                bq.decompose()
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = full_text.strip()
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data


# Scrape articles and create a DataFrame
squib_data_df = scrape_multiple_squib_articles(urls, scrape_squib_article)
# Store to CSV
squib_data_df.to_csv("satire_scraped_articles_squib.csv", index=False)
# Print df 
squib_data_df

**Waterford Whispers**

50 articles were take from the homepage (https://waterfordwhispersnews.com/), sorted from most recent to least recent.

In [None]:
urls = [
    "https://waterfordwhispersnews.com/2025/01/06/dickhead-boss-wants-to-hit-the-ground-running-in-2025/",
    "https://waterfordwhispersnews.com/2025/01/03/organised-local-woman-straight-onto-revenue-portal-to-get-that-e4-25-tax-back-shes-owed/",
    "https://waterfordwhispersnews.com/2025/01/06/eight-additional-data-centres-needed-to-store-pictures-of-irish-snow-energy-watchdog-warns/",
    "https://waterfordwhispersnews.com/2025/01/06/colin-farrell-asks-if-three-golden-globes-are-redeemable-for-one-oscar/",
    "https://waterfordwhispersnews.com/2025/01/06/remote-workers-wouldnt-have-agreed-to-work-from-home-if-they-knew-it-meant-zero-snow-days-off/",
    "https://waterfordwhispersnews.com/2025/01/06/im-too-old-and-rich-to-be-righteous-anymore-bono/",
    "https://waterfordwhispersnews.com/2025/01/02/irish-couples-under-increasing-pressure-to-have-minimoon/",
    "https://waterfordwhispersnews.com/2025/01/01/5-realistic-new-years-resolutions/",
    "https://waterfordwhispersnews.com/2024/12/30/seeing-elder-millennials-in-news-headlines-like-a-stab-in-the-heart/",
    "https://waterfordwhispersnews.com/2024/12/26/nations-traffic-at-standstill-as-post-christmas-re-turn-machine-queues-clog-roads/",
    "https://waterfordwhispersnews.com/2024/12/24/christmas-miracle-fine-gael-fianna-fail-put-aside-differences-to-play-football-in-leinster-house-trenches/",
    "https://waterfordwhispersnews.com/2024/12/22/local-woman-cant-believe-how-many-bullshit-made-up-christmas-traditions-in-laws-have/",
    "https://waterfordwhispersnews.com/2024/12/20/how-come-you-your-dad-support-different-teams-innocently-asks-girlfriend-about-to-receive-crash-course-in-chelseas-early-2000s-transformation/",
    "https://waterfordwhispersnews.com/2024/12/19/investigation-launched-to-discover-why-sofas-no-longer-come-with-arm-rest-covers/",
    "https://waterfordwhispersnews.com/2024/12/19/tesco-expand-self-service-checkout-to-include-customer-stacking-shelves-processing-deliveries/",
    "https://waterfordwhispersnews.com/2024/12/18/fears-rip-ie-death-notice-charge-may-turn-irish-funerals-into-a-money-racket/",
    "https://waterfordwhispersnews.com/2024/12/18/martina-burke-starting-to-suspect-family-would-do-anything-to-get-away-from-her/",
    "https://waterfordwhispersnews.com/2024/12/18/people-insisting-on-posting-about-death-destruction-in-the-birthplace-of-jesus-asked-to-shut-up-so-we-can-all-enjoy-a-guilt-free-christmas/",
    "https://waterfordwhispersnews.com/2024/12/18/fresh-hope-irish-politics-changing-for-better-with-ff-fg-supporting-td-who-thinks-3-year-old-immigrants-are-in-isis-for-ceann-comhairle/",
    "https://waterfordwhispersnews.com/2024/12/18/on-this-day-1981-irelands-first-swingers-discuss-having-no-one-to-ride/",
    "https://waterfordwhispersnews.com/2024/12/17/what-irish-people-are-saying-about-israel-closing-its-embassy-in-ireland/",
    "https://waterfordwhispersnews.com/2024/12/18/4-year-old-pours-glass-of-ribena-after-another-exhausting-day/",
    "https://waterfordwhispersnews.com/2024/12/16/mcentee-reassures-public-dublin-city-safe-between-the-times-of-10-23am-10-27am-every-second-tuesday/",
    "https://waterfordwhispersnews.com/2024/12/16/former-fine-gael-politician-charged-with-human-trafficking-possessing-sex-abuse-images-some-media-half-heartedly-reports/",
    "https://waterfordwhispersnews.com/2024/12/16/thousands-of-psychologists-descend-on-manchester-city-to-observe-study-guardiolas-meltdown/",
    "https://waterfordwhispersnews.com/2024/12/16/country-that-stole-irish-passports-for-use-in-assassinations-attacked-irish-peacekeepers-closes-embassy-in-ireland-over-countrys-opposition-to-genocide/",
    "https://waterfordwhispersnews.com/2024/12/16/conor-mcgregor-wins-brought-most-shame-to-ireland-at-rte-sports-awards/",
    "https://waterfordwhispersnews.com/2024/12/13/local-mans-always-had-a-distrust-of-the-state-ever-since-they-caught-him-doing-illegal-things/",
    "https://waterfordwhispersnews.com/2024/12/13/super-low-key-girls-xmas-meet-up-somehow-costs-e427/",
    "https://waterfordwhispersnews.com/2024/12/13/pep-guardiola-calls-shamrock-rovers-for-tips-on-winning-in-europe/",
    "https://waterfordwhispersnews.com/2024/12/12/everyone-advised-to-ignore-report-saying-house-prices-overvalued-by-10-everything-is-fine/",
    "https://waterfordwhispersnews.com/2024/12/12/man-greets-suitable-for-8-people-label-on-food-like-a-challenge/",
    "https://waterfordwhispersnews.com/2024/12/12/saudi-arabia-begins-plotting-workers-mass-grave-for-world-cup-2034/",
    "https://waterfordwhispersnews.com/2024/12/11/revenue-raise-concerns-as-some-gaa-county-board-accounts-written-in-marker-on-hurl-grip-tape/",
    "https://waterfordwhispersnews.com/2024/12/11/israel-bagsies-syria/",
    "https://waterfordwhispersnews.com/2024/12/11/luigi-mangione-remains-producers-favourite-for-next-season-of-the-bachelor/",
    "https://waterfordwhispersnews.com/2024/12/09/dont-worry-were-the-good-terrorists/",
    "https://waterfordwhispersnews.com/2024/12/11/hayes-sickened-he-didnt-hold-out-as-stocks-he-dumped-soars/",
    "https://waterfordwhispersnews.com/2024/12/09/some-irish-villages-resorting-to-cannibalism-as-power-yet-to-be-restored-after-storm-darragh/",
    "https://waterfordwhispersnews.com/2024/12/06/person-at-top-of-shop-queue-taken-completely-by-surprise-by-request-to-pay-for-goods/",
    "https://waterfordwhispersnews.com/2024/12/09/assad-to-be-given-tour-of-moscows-most-beautiful-10th-storey-windows/",
    "https://waterfordwhispersnews.com/2024/12/06/local-dad-enters-2nd-hour-of-christmas-tree-price-negotiation-stand-off/",
    "https://waterfordwhispersnews.com/2024/12/05/unitedhealthcare-board-not-sure-when-right-time-to-break-it-to-ceos-family-none-of-this-is-covered-under-insurance-plan/",
    "https://waterfordwhispersnews.com/2024/12/04/parents-counteract-child-dropping-hints-about-xmas-presents-with-dropping-hints-theyre-broke-as-fuck/",
    "https://waterfordwhispersnews.com/2024/12/04/you-up-hun-ff-fg-sends-late-night-text-to-michael-lowry/",
    "https://waterfordwhispersnews.com/2024/12/04/thats-bad-form-now-kim-jong-un-urges-south-korean-president-to-stand-down/",
    "https://waterfordwhispersnews.com/2024/12/03/supplying-weapons-to-kill-50000-people-all-fine-but-just-dont-pardon-your-son-biden-told/",
    "https://waterfordwhispersnews.com/2024/12/03/first-100-violations-of-ceasefire-are-free-us-tells-israel-as-idf-continues-to-strike-lebanon/",
    "https://waterfordwhispersnews.com/2024/12/03/fianna-fail-fine-gaels-red-lines-for-going-into-government-with-each-other/",
    "https://waterfordwhispersnews.com/2024/12/03/calm-down-sugar-tits-gregg-wallaces-guide-to-crafting-the-perfect-apology/",
    
]

def scrape_whispers_article(url):
    """
    Scrapes an article from a given URL on waterfordwhispersnews.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", #satire is hardcoded here as we know WaterfordWhispers is a satire site
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Title
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Published date 
        date_div = soup.find("div", class_="post-date", itemprop="datePublished")
        if date_div:
            article_data["date"] = date_div.get_text(strip=True)
        else:
            article_data["date"] = "Date not found"
 
        # Category (excluding the ones used just for web display)
        excluded_categories = {"breaking news", "featured-one", "featured-two", "featured-three","homepage"}
        category_div = soup.find("div", class_="post-category")
        if category_div:
            all_cats = [a.get_text(strip=True) for a in category_div.find_all("a")]
            valid_cats = [cat for cat in all_cats
                          if cat.lower() not in excluded_categories]
            if valid_cats:
                article_data["category"] = valid_cats[0]
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"

        # Article copy
        content_div = soup.find("div", class_="article-content", itemprop="articleBody")
        if content_div:
            for p_tag in content_div.find_all("p"):
                p_text = p_tag.get_text(strip=True).lower()
                # remove marketing snippets
                if "check out our shop." in p_text or "www.waterfordwhispers.shop" in p_text or "buy some of our merch here" in p_text or "help us to keep pissing off all the right people" in p_text:
                    p_tag.decompose()

            # remove blockquotes
            for bq in content_div.find_all("blockquote"):
                bq.decompose()

            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = full_text.strip()
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# Scrape articles and create a DataFrame
whispers_data_df = scrape_multiple_whispers_articles(urls, scrape_whispers_article)
# Store to CSV
whispers_data_df.to_csv("satire_scraped_articles_whispers.csv", index=False)
# Print df 
whispers_data_df.head()

In [None]:
# Combine DataFrames
satire_dataset = pd.concat(
    [whispers_data_df, squib_data_df, bee_data_df, onion_data_df],
    ignore_index=True
)

# Basic checks
print(satire_dataset.info())   # Data types & non-null counts
print(satire_dataset.head())   # Quick glance at first rows

# Print out the categories
print(satire_dataset["category"].value_counts())

# Confirm 4 sites are represented
print("Number of unique sites:", satire_dataset["site"].nunique())

In [None]:
def clean_category(cat: str) -> str:
    """
    Convert categories to lowercase, unify synonyms, and return a single standardised category.
    """
    # Convert to lowercase
    c = cat.strip().lower()
    
    # Standardise categories
    replacements = {
        'politics': 'politics',
        'local news': 'local',
        'world news': 'world',
        'world': 'world',
        'worldviews':'world',
        'entertainment': 'entertainment',
        'business': 'business',
        'health': 'health',
        'lifestyle': 'lifestyle',
        'life':'lifestyle',
        'sports': 'sports',
        'sport': 'sports',
        'football': 'sports',
        'gaa': 'sports',
        'sci/tech': 'technology',
        'celebs':'entertainment',
        'tech': 'tech',
        'u.s.':'united states',
        'uplifting viral content': 'entertainment', 
    }

    # Do the changes
    if c in replacements:
        c = replacements[c]
    return c

# Apply the cleaning
satire_dataset['category'] = satire_dataset['category'].apply(clean_category)

# Now check the new distribution
print(satire_dataset['category'].value_counts())

In [None]:
# Store to CSV
satire_dataset.to_csv("satire_articles.csv", index=False)

### 5. Commentary
Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding or presenting mainly personal interpretation.

##### Features:

- Personal Interpretation: The writer’s subjective opinions or experiences form the core of the content.
- Limited Fact-Checking: Minimal reliance on verified data; opinions may be framed as personal reflections or “takes.”
- Editorial or Opinion Section: Typically appears in editorial pages, op-eds, blogs, or similar formats clearly labeled as opinion.

##### Label If:

- The text is primarily an opinion piece discussing how the author feels about an event, topic, or policy.
- The author uses subjective language (e.g., “I believe…,” “In my view…”) rather than objective reporting.

##### Do Not Label If:

- The commentary deliberately misrepresents facts to persuade or manipulates partial truths (label as Polarised).
- The commentary is disguised marketing or propaganda with a clear persuasive goal (label as Persuasive).

##### Sources:
- https://www.wsws.org/en/topics/site_area/perspectives
- https://www.huffpost.com/section/opinion
- https://www.nytimes.com/international/section/opinion
- https://www.washingtonpost.com/opinions/
- https://www.theguardian.com/uk/commentisfree
- https://www.nature.com/nature/articles?type=editorial

In [None]:
#https://www.washingtonexaminer.com/opinion/columnists/3318785/complicated-story-iron-mountain/

urls = [
    'https://www.washingtonexaminer.com/opinion/columnists/3311520/trump-rides-the-vibes/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3311291/ignore-trumps-gaza-distraction-focus-iran/',
    'https://www.washingtonexaminer.com/restoring-america/faith-freedom-self-reliance/3311173/how-trump-and-doge-should-reform-social-security-administration/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3154264/universities-need-more-change-than-neutrality/',
    'https://www.washingtonexaminer.com/op-eds/3311024/rubio-must-reverse-the-biden-administrations-designations-of-us-allies/',
    'https://www.washingtonexaminer.com/opinion/editorials/3310152/rubio-good-start-blunting-chinese-influence-panama/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3310965/trump-endangers-american-interests-gaza-ownership-plan/',
    'https://www.washingtonexaminer.com/opinion/3310524/early-look-virginia-governor-race/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3310112/trump-shows-tariffs-work/',
    'https://www.washingtonexaminer.com/restoring-america/patriotism-unity/3310394/ending-plunder-grift-usaid/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3310324/democrats-abortion-cult-getting-more-morbid/',
    'https://www.washingtonexaminer.com/opinion/3310165/sean-parnell-once-again-answers-call-serve/',
    'https://www.washingtonexaminer.com/in_focus/3309253/trump-can-reset-relations-iran-mullah-regime/',
    'https://www.washingtonexaminer.com/opinion/3309085/trump-can-help-prevent-aviation-disasters/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3309352/partners-can-help-us-thwart-russia-iran-axis/',
    'https://www.washingtonexaminer.com/restoring-america/fairness-justice/3309410/violent-climate-action-not-free-speech/',
    'https://www.washingtonexaminer.com/opinion/3322085/trump-should-help-save-ss-united-states/',
    'https://www.washingtonexaminer.com/restoring-america/courage-strength-optimism/3321134/trump-latin-america-realignment-puts-america-first/',
    'https://www.washingtonexaminer.com/restoring-america/faith-freedom-self-reliance/3319748/democrats-backed-doge-until-trump-took-over/',
    'https://www.washingtonexaminer.com/op-eds/3321081/trump-should-call-new-elections-georgia/',
    'https://www.washingtonexaminer.com/restoring-america/fairness-justice/3315981/trump-should-follow-florida-virginia-models-criminal-justice-reform/',
    'https://www.washingtonexaminer.com/opinion/3313825/doge-root-american-values/',
    'https://www.washingtonexaminer.com/restoring-america/community-family/3312627/kelly-loeffler-is-the-champion-small-businesses-need/',
    'https://www.washingtonexaminer.com/restoring-america/community-family/3310424/halt-fentanyl-act-gives-americans-hope/',
    'https://www.washingtonexaminer.com/restoring-america/patriotism-unity/3309280/new-day-border-security-rule-of-law/',
    'https://www.washingtonexaminer.com/opinion/beltway-confidential/3321972/conception-begins-at-erection-absurd-theater-left/',
    'https://www.washingtonexaminer.com/restoring-america/community-family/3321936/problem-artificial-womb/',
    'https://www.washingtonexaminer.com/opinion/editorials/3321826/europe-cannot-handle-truth-free-speech/',
    "https://www.washingtonexaminer.com/opinion/columnists/3318785/complicated-story-iron-mountain/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3318406/disney-families-over-politics/",
    "https://www.washingtonexaminer.com/opinion/columnists/3318333/why-were-hopes-1990s-dashed/",
    "https://www.washingtonexaminer.com/in_focus/3318161/many-problems-trump-gaza-plan/",
    "https://www.washingtonexaminer.com/restoring-america/3317773/trump-transgender-order-win-for-religious-liberty/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3317606/journalists-discover-the-constitution/",
    "https://www.washingtonexaminer.com/restoring-america/faith-freedom-self-reliance/3317637/lori-chavez-deremers-pro-union-stance-makes-her-poor-choice-labor-secretary/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3317675/democrats-activist-judges-against-democracy/",
    "https://www.washingtonexaminer.com/restoring-america/patriotism-unity/3317626/fort-bragg-latest-name-change-pure-political-appeasement/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3317577/trump-must-stop-uks-dangerous-surrender-chagos-islands/",
    "https://www.washingtonexaminer.com/restoring-america/fairness-justice/3317053/shein-and-temu-must-be-restricted-over-slave-labor/",
    "https://www.washingtonexaminer.com/restoring-america/courage-strength-optimism/3316984/wed-texas-leads-in-securing-the-border/",
    "https://www.washingtonexaminer.com/opinion/editorials/3317408/trump-right-on-birthright-citizenship/",
    "https://www.washingtonexaminer.com/in_focus/3316988/super-bowl-rings-in-return-americana/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3317019/war-plastic-straws-highlights-democrats-shoddy-science/",
    "https://www.washingtonexaminer.com/opinion/columnists/3317377/beloved-pennsylvanian-returns-home-russian-prison/",
    "https://www.washingtonexaminer.com/restoring-america/fairness-justice/3317026/trump-global-golden-age-religious-liberty/",
    "https://www.washingtonexaminer.com/daily-memo/3317404/biden-economic-hangover/",
    "https://www.washingtonexaminer.com/opinion/columnists/3315442/constitutional-crisis-blame-democrats/",
    "https://www.washingtonexaminer.com/opinion/columnists/3317163/dumbest-immigration-policy-in-world/",
    "https://www.washingtonexaminer.com/opinion/beltway-confidential/3316191/trump-runs-defense-deep-state-mark-zaid-clearance-revocation/",
    "https://www.washingtonexaminer.com/opinion/3316603/jd-vance-trip-shows-confidence-tulsi-gabbard-rfk-jr-confirmation/",  
]

len(urls)
len(set(urls))

def scrape_washexam_article(url):
    """
    Scrapes an article from a given URL on washingtonexaminer.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print(soup)
        
        # Title
        title_meta = soup.find('meta', property='og:title')
        title = title_meta['content'] if title_meta else "Title not found"
        if " - Washington Examiner" in title:
            title = title.replace(" - Washington Examiner", "").strip()
        article_data["title"] = title
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Site name
        site_name_meta = soup.find('meta', property='og:site_name')
        site = site_name_meta['content'] if site_name_meta else "Site name not found"
        site = site.split(" - ")[0].strip()  # keep only the first part
        article_data["site"] = site
        
        # Published date (from the meta tag)
        pub_date = soup.find("meta", property="article:published_time")
        if pub_date:
            article_data["date"] = pub_date.get("content", "").strip()
        else:
            article_data["date"] = "Date not found"
        
        article_body = soup.find("div", class_="td-post-content")
        if article_body:
            # Remove all <figure> elements.
            for figure in article_body.find_all("figure"):
                figure.decompose()
            # Remove any <a> tag whose text starts with the unwanted phrase.
            for a in article_body.find_all("a"):
                a_text = a.get_text(strip=True)
                if re.match(r"^click\s+here\s+to\s+read\s+more\s+from", a_text, flags=re.IGNORECASE):
                    a.decompose()
            # Extract the text, using a space as separator.
            raw_text = article_body.get_text(separator=" ", strip=True)
            # Replace multiple whitespace/newlines with a single space.
            cleaned_text = re.sub(r'\s+', ' ', raw_text)
            article_data["text"] = cleaned_text
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# Provide a common browser user agent - otherwise the scraping fails
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
    )
}
# Scrape articles and create a DataFrame
washexam_data_df = scrape_multiple_articles(urls, scrape_washexam_article)
# Store to CSV
washexam_data_df.to_csv("commentary_scraped_articles_washexam.csv", index=False)
# Print df 
washexam_data_df

In [None]:
urls = ['https://www.nature.com/articles/d41586-025-00402-x',
        'https://www.nature.com/articles/d41586-025-00282-1',
        'https://www.nature.com/articles/d41586-025-00283-0',
        'https://www.nature.com/articles/d41586-025-00214-z',
        'https://www.nature.com/articles/d41586-025-00213-0',
        'https://www.nature.com/articles/d41586-025-00150-y',
        'https://www.nature.com/articles/d41586-025-00050-1',
        'https://www.nature.com/articles/d41586-025-00049-8',
        'https://www.nature.com/articles/d41586-025-00014-5',
        'https://www.nature.com/articles/d41586-025-00015-4',
        'https://www.nature.com/articles/d41586-024-04159-7',
        'https://www.nature.com/articles/d41586-024-04114-6',
        'https://www.nature.com/articles/d41586-024-04113-7',
        'https://www.nature.com/articles/d41586-024-04046-1',
        'https://www.nature.com/articles/d41586-024-03911-3',
        'https://www.nature.com/articles/d41586-024-03910-4',
        'https://www.nature.com/articles/d41586-024-03932-y',
        'https://www.nature.com/articles/d41586-024-03843-y',
        'https://www.nature.com/articles/d41586-024-03842-z',
        'https://www.nature.com/articles/d41586-024-03753-z',
        'https://www.nature.com/articles/d41586-024-03673-y',
        'https://www.nature.com/articles/d41586-024-03648-z',
        'https://www.nature.com/articles/d41586-024-03585-x',
        'https://www.nature.com/articles/d41586-024-03485-0',
        'https://www.nature.com/articles/d41586-024-03417-y',
        'https://www.nature.com/articles/d41586-024-03418-x',
        'https://www.nature.com/articles/d41586-024-03331-3',
        'https://www.nature.com/articles/d41586-024-03332-2',
        'https://www.nature.com/articles/d41586-024-03266-9',
        'https://www.nature.com/articles/d41586-024-03267-8',
        'https://www.nature.com/articles/d41586-024-03182-y',
        'https://www.nature.com/articles/d41586-024-03183-x',
        'https://www.nature.com/articles/d41586-024-03109-7',
        'https://www.nature.com/articles/d41586-024-03110-0',
        'https://www.nature.com/articles/d41586-024-02992-4',
        'https://www.nature.com/articles/d41586-024-02991-5',
        'https://www.nature.com/articles/d41586-024-02912-6',
        'https://www.nature.com/articles/d41586-024-02913-5',
        'https://www.nature.com/articles/d41586-024-02828-1',
        'https://www.nature.com/articles/d41586-024-02829-0',
        'https://www.nature.com/articles/d41586-024-02757-z',
        'https://www.nature.com/articles/d41586-024-02673-2',
        'https://www.nature.com/articles/d41586-024-02600-5',
        'https://www.nature.com/articles/d41586-024-02533-z',
        'https://www.nature.com/articles/d41586-024-02445-y',
        'https://www.nature.com/articles/d41586-024-02381-x',
        'https://www.nature.com/articles/d41586-024-02314-8',
        'https://www.nature.com/articles/d41586-024-02224-9',
        'https://www.nature.com/articles/d41586-024-02169-z',
        'https://www.nature.com/articles/d41586-024-02080-7'
       ]

def scrape_nat_article(url):
    """
    Scrapes an article from a given URL on nature.com and extracts relevant information.
    Extracts title, text, site, and published date.
    For site name, it falls back to the twitter:site meta tag if og:site_name is missing.
    For date, it attempts to extract from meta tag "article:published_time" first, then JSON-LD.
    For text, it looks for <div class="c-article-body">.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Commentary", 
        "url": url
    }
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
    }
    
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print (soup)
        # Title: use og:title or fallback to the <title> tag.
        title_meta = soup.find('meta', property='og:title')
        title = title_meta['content'] if title_meta else (soup.title.string if soup.title else "Title not found")
        article_data["title"] = title
        
        # URL
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url
        
        # Site name: try og:site_name first; if missing, use twitter:site.
        site_name_meta = soup.find('meta', property='og:site_name')
        if site_name_meta:
            site = site_name_meta['content']
        else:
            twitter_site = soup.find('meta', attrs={'name': 'twitter:site'})
            if twitter_site:
                site = twitter_site['content']
                if site.startswith("@"):
                    site = site[1:]
            else:
                site = "Site name not found"
        article_data["site"] = site
        
        # Published date: first try meta tag; then fall back to JSON-LD.
        pub_date_meta = soup.find("meta", property="article:published_time")
        if pub_date_meta:
            article_data["date"] = pub_date_meta.get("content", "").strip()
        else:
            ld_script = soup.find("script", type="application/ld+json")
            if ld_script:
                try:
                    ld_json = json.loads(ld_script.string)
                    # Sometimes ld+json is a list of objects
                    if isinstance(ld_json, list):
                        ld_json = ld_json[0]
                    if "mainEntity" in ld_json and "datePublished" in ld_json["mainEntity"]:
                        article_data["date"] = ld_json["mainEntity"]["datePublished"]
                    elif "datePublished" in ld_json:
                        article_data["date"] = ld_json["datePublished"]
                    else:
                        article_data["date"] = "Date not found"
                except Exception:
                    article_data["date"] = "Date not found"
            else:
                article_data["date"] = "Date not found"
        
        # Category: extract from <li data-test="article-category">
        cat_li = soup.find('li', attrs={'data-test': 'article-category'})
        if cat_li:
            cat_span = cat_li.find('span', class_='c-article-identifiers__type')
            if cat_span:
                article_data["category"] = cat_span.get_text(strip=True)
            else:
                article_data["category"] = "Category not found"
        else:
            article_data["category"] = "Category not found"
        
        # Article body: look for the div with class "c-article-body"
        article_body = soup.find("div", class_=lambda c: c and "c-article-body" in c)
        if article_body:
            # Remove all <figure> elements.
            for figure in article_body.find_all("figure"):
                figure.decompose()
            
            # Also remove header title and teaser text from within article_body (if present).
            header_title_elem = article_body.find("h1", class_="c-article-magazine-title")
            if header_title_elem:
                header_title_elem.decompose()
            teaser_elem = article_body.find("div", class_="c-article-teaser-text")
            if teaser_elem:
                teaser_elem.decompose()
            
            # Now, gather the remaining paragraphs.
            paragraphs = article_body.find_all("p")
            # Also fetch header title and teaser from outside the article body as skip markers.
            header_title = ""
            teaser_text = ""
            ext_header = soup.find("h1", class_="c-article-magazine-title")
            if ext_header:
                header_title = ext_header.get_text(strip=True).lower()
            ext_teaser = soup.find("div", class_="c-article-teaser-text")
            if ext_teaser:
                teaser_text = ext_teaser.get_text(strip=True).lower()
            
            article_paragraphs = []
            for p in paragraphs:
                p_text = p.get_text(separator=" ", strip=True)
                lower_text = p_text.lower()
                # Skip paragraphs that contain the header title or teaser text
                if header_title and header_title in lower_text:
                    continue
                if teaser_text and teaser_text in lower_text:
                    continue
                article_paragraphs.append(p_text)
            
            # Join paragraphs and clean extra whitespace.
            article_text = " ".join(article_paragraphs)
            cleaned_text = re.sub(r'\s+', ' ', article_text)
            article_data["text"] = cleaned_text
        else:
            article_data["text"] = "Article text not found"
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data

# Provide a common browser user agent - otherwise the scraping fails
headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
}
# Scrape articles and create a DataFrame
nat_data_df = scrape_multiple_articles(urls, scrape_nat_article)
# Store to CSV
nat_data_df.to_csv("commentary_scraped_articles_nat.csv", index=False)
# Print df 
nat_data_df

In [None]:
urls = ["https://www.rollingstone.com/politics/politics-features/baltimore-sun-right-wing-takeover-david-smith-1235268329/"]

def scrape_stone_article(url):
    """
    Scrapes an article from a given URL on rollingstone.com and extracts relevant information.
    Extracts title, text, site, and published date.
    For site name, it falls back to the twitter:site meta tag if og:site_name is missing.
    For date, it attempts to extract from meta tag "article:published_time" first, then JSON-LD.
    For text, it looks for <div class="c-article-body">.
    """
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/115.0.0.0 Safari/537.36"
        )
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return {"error": "Failed to retrieve the page"}
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Remove advertisement blocks if present
    for ad in soup.find_all("div", class_="admz"):
        ad.decompose()

    # Locate the container that holds the article body.
    article_container = soup.find("div", class_="pmc-paywall")
    if not article_container:
        return {"error": "Article container not found"}
    
    # Find all paragraphs that include the article text.
    paragraphs = article_container.find_all("p", class_=lambda x: x and "paragraph" in x)
    
    # Join paragraphs together, preserving some separation.
    article_text = "\n\n".join(p.get_text(separator=" ", strip=True) for p in paragraphs)
    
    # Extract other metadata if needed
    title_tag = soup.find("meta", property="og:title")
    title = title_tag["content"] if title_tag and title_tag.has_attr("content") else "Title not found"
    
    published_tag = soup.find("meta", property="article:published_time")
    published_date = published_tag["content"] if published_tag and published_tag.has_attr("content") else "Date not found"
    
    # Return a dictionary with the article details
    return {
        "title": title,
        "text": article_text,
        "date": published_date,
        "url": url
    }


def scrape_multiple_stone_articles(urls):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped data from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_stone_article(url)
        articles.append(article)
    return pd.DataFrame(articles)

# Provide a common browser user agent - otherwise the scraping fails
headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
}
# Scrape articles and create a DataFrame
stone_data_df = scrape_multiple_stone_articles(urls)
# Store to CSV
stone_data_df.to_csv("commentary_scraped_articles_stone.csv", index=False)
# Print df 
stone_data_df

In [None]:
urls = ['https://www.washingtonpost.com/opinions/2025/02/17/trump-education-department-linda-mcmahon/?itid=sf_opinions_top-table-all-collex_p001_f002']