# TruthLens - Data Collection

TruthLens is a project developed for the BSc. Computer Science (Data Science) Final Project (CM3070) at the University of London. TruthLens is based on the Fake News Detection template. 

## Project Objectives
The primary objective of this project is to build a two-stage pipeline for misinformation classification:

1. Binary classification (Stage 1): Distinguish between real news and misinformation using the ISOT dataset. This ensures robust detection at the first stage, leveraging an established dataset.
2. Multi-class classification (Stage 2): Further classify content identified as misinformation into one of seven categories, based on Molina et al.’s taxonomy. A custom dataset will support this nuanced classification.

The scope of the project is limited to text-based, English language content, explicitly excluding images and videos. A user interface will also be developed, enabling users to input articles or URLs and receive classification results.

A secondary objective is to enhance the explainability of classification results, aiming to provide users with interpretable insights into why content was classified in a particular way.

The project aims for high accuracy and reliability, with measurable performance goals. Ethical considerations, including bias mitigation and responsible dataset usage, will guide the design and implementation of the pipeline.

## Custom dataset generation
As outlined in the previous section, the second stage of the pipeline relies upon a custom dataset, labelled with the categories from the Molina et al. Misinformation Taxonomy. These classes are summarised in the table below. The aim of this stage is to create a balanced dataset with 200 pieces of content for each of the 7 categories. 

| Misinformation Type | Characteristics | Example |
|:--------------|:---------------|:-------|
| Fabricated content | Completely false content created with the intent to deceive.| Fake reports of events that never occurred; entirely false claims about public figures |
|Polarised content |True events or facts presented selectively to promote a biased narrative, often omitting critical context. |Partisan news articles highlighting one side of a political argument while ignoring counterpoints.|
|Satire |Content intended to entertain or provoke thought through humour, exaggeration, or irony. Often misunderstood. |Satirical articles from outlets like “The Onion” being shared as if they are factual news.|
|Misreporting | Incorrect information shared unintentionally, often due to errors or lack of verification. | A news outlet incorrectly reporting election results due to early or inaccurate data.|
|Commentary |Opinion-based content reflecting the writer’s interpretation or viewpoint, often lacking factual grounding. |Editorials or blogs expressing subjective opinions without substantial evidence.|
|Persuasive information |Content designed to persuade or influence the audience, often including marketing and propaganda. |Politically motivated propaganda campaigns, advertisements disguised as objective news articles.|
|Citizen journalism | User-generated content that may lack professional journalistic standards, leading to error or bias. |Social media posts about breaking news that spread unverified or incorrect details.|

Data will be scrapped from relevant websites for each category, then manually reviewed to ensure that it fits the category. Relevant features and labelling guidelines can be found for each category below.

In [1]:
#Imports and helper functions
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
pd.set_option('display.max_colwidth', None)
import re
import string
import nltk
from nltk.corpus import stopwords

def preprocess_text(text):
    """
        Preprocesses a given text string by applying the following steps:
        1. Converts the text to lowercase.
        2. Removes punctuation marks.
        3. Tokenizes the text into individual words.
        4. Removes stopwords (common words that add little value to classification tasks).

        Parameters:
        ----------
        text : str
            The input text string to preprocess.

        Returns:
        -------
        str
            The cleaned and preprocessed text, with tokens joined back into a single string.
    """
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = [word for word in text.split() if word not in stop_words]
    return ' '.join(tokens)

### 3. Satire
Satirical content is intended to entertain or provoke thought through humor, exaggeration, or irony. Satire is often misunderstood as factual. 

##### Features:

- Humourous or Exaggerated Tone: Content is typically marked by wit, parody, or absurdity.
- Intentional Ridiculousness: The story is meant to be funny, not factual; outlandish claims serve comedic purposes.

##### Label If:

- The piece’s goal is clearly comedic or parodic, rather than deceptive.
- The tone, language, or disclaimers indicate it’s intentionally satirical.

##### Do Not Label If:

- The piece uses humour but is still intended to mislead (label as Fabricated Content).
- The piece is comedic but still pushing a heavily skewed narrative as if it’s true (label as Polarised Content).

##### Sources:
- The Onion (American site)
- Babylon Bee
- Clickhole
- Waterford Whispers


**The Onion**

The articles scraped are the ones featured on the 2024 "Annual Year" post found here: https://theonion.com/our-annual-year-2024/ - the top 5 from each month have been chosen (image posts have been excluded as per scope), so a total of 60 articles.

In [2]:
def scrape_onion_article(url):
    """
    Scrapes an article from a given URL on theonion.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", #satire is hardcoded here as we know TheOnion is a satire site
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract the title from the meta property "og:title"
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # Extract the URL from the meta property "og:url"
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Extract the site name from the meta property "og:site_name"
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Extract the published date from the meta property "article:published_time"
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Extract the category (e.g., "Politics")
        category_element = soup.find('div', class_='taxonomy-category')
        category_link = category_element.find('a') if category_element else None
        article_data["category"] = category_link.text.strip() if category_link else "Category not found"
        
        # Extract the article copy
        content_div = soup.find(
            "div",
            {"class": lambda x: x and "entry-content" in x and "single-post-content" in x}
        )
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
            article_data["text"] = full_text
        else:
            article_data["text"] = "Article text not found"
    
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data


def scrape_multiple_onion_articles(urls):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped data from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_onion_article(url)
        articles.append(article)
    return pd.DataFrame(articles)


# List of URLs to scrape
urls = [
    #January
    "https://theonion.com/biden-addresses-nation-while-hanging-from-branch-on-sid-1851106795/",
    "https://theonion.com/marriage-counselor-sides-with-hotter-spouse-1851143488/",
    "https://theonion.com/wealthy-dad-surprises-child-with-tree-house-he-can-airb-1851112919/",
    "https://theonion.com/glowing-pulsating-hair-product-takes-control-of-gavin-1851160421/",
    "https://theonion.com/gen-z-announces-julie-andrews-is-problematic-but-refuse-1851180352/",
    #February
    "https://theonion.com/mrbeast-announces-he-has-resurrected-everyone-buried-at-1851217565/",
    "https://theonion.com/introverted-cowboy-struggling-to-round-up-posse-1851226175/",
    "https://theonion.com/country-stations-refuse-to-play-beyonce-s-music-after-a-1851261135/",
    "https://theonion.com/stab-him-stab-him-you-cowards-says-terrified-kamal-1851243467/",
    "https://theonion.com/emerging-filmmaker-malia-obama-changes-surname-to-scors-1851278946/",
    #March
    "https://theonion.com/u-s-airdrops-rubble-into-gaza-1851305713/",
    "https://theonion.com/ozempic-maker-triumphantly-announces-new-drug-that-make-1851320436/",
    "https://theonion.com/study-millennial-women-forgoing-dating-apps-in-favor-o-1851338275/",
    "https://theonion.com/beyonce-reveals-new-country-album-cover-featuring-tooth-1851355991/",
    "https://theonion.com/but-dog-likes-fighting-for-money-1851352386/",
    #April
    "https://theonion.com/finance-whiz-has-over-300-in-bank-account-1851375065/",
    "https://theonion.com/sotheby-s-announces-auction-of-napkin-on-which-jeffrey-1851375213/",
    "https://theonion.com/o-j-simpson-allowed-to-remain-living-after-coffin-does-1851403804/",
    "https://theonion.com/travis-kelce-impresses-coachella-crowd-by-tossing-taylo-1851410856/",
    "https://theonion.com/biden-carried-away-by-ants-1851422363/",
    #May
    "https://theonion.com/tesla-lays-off-entire-team-behind-brakes-1851449223/",
    "https://theonion.com/drake-drops-new-track-inviting-kendrick-lamar-out-to-co-1851458534/",
    "https://theonion.com/perdue-announces-initiative-to-even-the-playing-field-b-1851423157/",
    "https://theonion.com/new-florida-law-requires-all-women-to-produce-3-healthy-1851482288/",
    "https://theonion.com/everyone-in-er-bit-off-finger-while-holding-sandwich-1851488798/",
    #June
    "https://theonion.com/cult-leader-not-even-charismatic-1851512851/",
    "https://theonion.com/embarrassed-david-attenborough-realizes-he-spent-10-min-1851512951/",
    "https://theonion.com/newest-u-s-aid-mission-just-single-powerbar-labeled-f-1851540802/",
    "https://theonion.com/report-every-place-on-earth-has-wrong-amount-of-water-1851544516/",
    "https://theonion.com/nasa-warns-space-hawk-has-swooped-in-and-picked-up-eart-1851544578/",
    #July
    "https://theonion.com/clarence-thomas-torn-over-case-where-both-sides-offer-c-1851566812/",
    "https://theonion.com/democrats-panic-after-kamala-harris-ages-40-years-in-si-1851601473/",
    "https://theonion.com/congress-bans-roofs-1851592883/",
    "https://theonion.com/news-happening-faster-than-man-can-generate-uninformed-1851601466/",
    "https://theonion.com/god-forced-to-shave-head-after-contracting-plague-of-li-1851580149/",
    #August
    "https://theonion.com/environmentalists-warn-u-s-running-out-of-small-wooded-1851609190/",
    "https://theonion.com/r-kelly-petitions-supreme-court-to-watch-him-pee-1851619802rev1723482404693/",
    "https://theonion.com/federated-union-of-bear-cub-carcass-dumpers-endorses-rf-1851613425/",
    "https://theonion.com/glen-powell-opens-up-about-dangerous-stunt-work-filming-with-sydney-sweeneys-breasts/",
    "https://theonion.com/j-d-vance-accuses-tim-walz-of-stolen-valor-for-wearing-1851621120/",
    #September
    "https://theonion.com/everyone-in-restaurant-jealous-of-toddler-who-gets-to-wear-pajamas-and-watch-ipad/",
    "https://theonion.com/horrified-taylor-swift-realizes-football-happens-every-year/",
    "https://theonion.com/trump-avoids-answering-hard-questions-by-pretending-he-shot-in-ear-again/",
    "https://theonion.com/man-replies-stop-to-political-fundraiser-text-like-powerful-wizard-casting-spell-to-ward-off-mythical-beast/",
    "https://theonion.com/scarecrow-has-double-ds/",
    #October
    "https://theonion.com/the-onion-officially-endorses-joe-biden-for-president/",
    "https://theonion.com/texas-sex-ed-class-teaches-boys-how-to-cheat-on-pregnant-wife/",
    "https://theonion.com/sabrina-carpenter-completes-mandatory-service-in-south-korean-military/",
    "https://theonion.com/north-carolina-family-informed-their-insurance-policy-voided-once-house-gets-wet/",
    "https://theonion.com/grandma-who-survived-great-depression-casually-drops-that-she-once-killed-man-for-mayonnaise/",
    #November
    "https://theonion.com/piss-soaked-tucker-carlson-claims-demon-urinated-on-him-while-he-slept/",
    "https://theonion.com/trump-calls-harris-to-congratulate-himself-on-winning/",
    "https://theonion.com/america-defeats-america/",
    "https://theonion.com/man-forgetting-difference-between-meteoroid-meteorite-struggles-to-describe-what-just-killed-his-dog/",
    "https://theonion.com/every-movement-in-mans-burrito-eating-technique-informed-by-past-burrito-tragedies/"
]

# Scrape articles and create a DataFrame
custom_data_df = scrape_multiple_onion_articles(urls)
# Store to CSV
custom_data_df.to_csv("satire_scraped_articles_onion.csv", index=False)
# Print head 
custom_data_df

Unnamed: 0,title,text,site,date,category,class,url
0,Biden Addresses Nation While Hanging From Branch On Side Of Cliff,"WASHINGTON—Using his platform to plead for Americans to lend him a hand, President Joe Biden addressed the nation Monday while hanging from a branch on the side of a cliff. “Our democracy has never before hung in the balance more than it has at this moment when I am in danger of plummeting 50 feet to those sharp rocks below,” said Biden, who implored the U.S. populace to set aside its differences and find a long stick, a rope, or, preferably, a helicopter that they could use to return him to stable ground. “What’s important is not what led us to this point, but rather how we choose to move forward in helping me back up. Even a carefully placed mattress or pile of sofa cushions would do. My fellow Americans, I urge you to act fast, as a small bird has landed on my head and is now pecking at me.” At press time, a Gallup Poll had found that 70% of Americans opposed Biden being rescued.",The Onion,2024-01-01T11:45:00+00:00,Politics,Satire,https://theonion.com/biden-addresses-nation-while-hanging-from-branch-on-sid-1851106795/
1,Marriage Counselor Sides With Hotter Spouse,"ANCHORAGE, AK—Stating that she had heard both perspectives and could understand their frustrations, marriage counselor Laurie Hartford reportedly told couple David and Julia Carter that she ultimately had to side with the hotter spouse. “So, I’ve listened to everything you’ve had to say, and I’ve come to the conclusion that while David does seem to be emotionally withholding, he’s also at least two points hotter,” said the therapist, who rushed to note that, in all fairness, she needed to take into consideration that she would at best describe the female half of the relationship “as, like, a six even on her best day.” “I’ve spent hours listening to you pour out your hearts and that’s never easy, so pat yourselves on the back. But, frankly, only one of you has bothered to comb your hair or put on a nice shirt at these sessions. I’m not in any way trying to invalidate your experiences. All I’m saying is that only one of you—David—has an ass that you could bounce a quarter off, and the other one is kind of an uggo, if that makes sense?” Hartford went on to say that it might be helpful if Julia stayed at home for their next sessions so that they could spend more time understanding where, exactly, David’s hotness came from.",The Onion,2024-01-09T11:30:00+00:00,Local,Satire,https://theonion.com/marriage-counselor-sides-with-hotter-spouse-1851143488/
2,Wealthy Dad Surprises Child With Tree House He Can Airbnb For Passive Income,"WILMETTE, IL—Telling the child not to peek as they walked into the backyard, local wealthy man Kenneth Schweitz reportedly surprised his son Tuesday with a tree house that the young boy could Airbnb for passive income. “It’s time you got your own little space that can be rented out for short-term stays and used to produce a reliable revenue stream,” a visibly excited Schweitz said as he took his hands off his son’s eyes to reveal the fully appointed structure built into the tree’s branches, stressing to the boy that he would not have to do any real work for the lodging to generate substantial returns. “Your mom and I can help you decorate it, but then it’s all up to you to decide how much to charge per night and which cleaning service to hire, bud. After that, you can sit back and collect thousands of dollars a month. How cool is that? You and your little friends are going to have so much fun building your little real estate empire. Enjoy!” At press time, sources reported Schweitz’s son was enthusiastically climbing into the tree house to serve an eviction notice to the low-income family currently living there.",The Onion,2024-01-09T17:30:00+00:00,Local,Satire,https://theonion.com/wealthy-dad-surprises-child-with-tree-house-he-can-airb-1851112919/
3,"Glowing, Pulsating Hair Product Takes Control Of Gavin Newsom’s Thoughts","SACRAMENTO, CA—As an otherworldly glow emanated from the California governor’s meticulously sculpted coiffure, sources confirmed Friday that the pulsating hair product on Gavin Newsom’s head had taken control of his thoughts. “There will be no bills signed, no presidential campaign—there will only be hair,” said the disembodied voice emanating from the greasy, slicked-back mass atop Newsom’s skull, his hair reportedly growing into thick, powerful tendrils long enough to choke out his political opponents anywhere they might try to hide in the State Capitol. “There will be no clemency for those who refuse to succumb to the wet and shiny hair. With these mighty strands, I command the wildfires and the earthquakes, the droughts and the floods!” At press time, sources confirmed Newsom’s hair product had evicted several homeless people seeking shelter within the throbbing gelatinous nest upon his head.",The Onion,2024-01-19T17:45:00+00:00,Politics,Satire,https://theonion.com/glowing-pulsating-hair-product-takes-control-of-gavin-1851160421/
4,Gen Z Announces Julie Andrews Is Problematic But Refuses To Explain Why,"​​NEW YORK—Standing before a crowd of millennials, Gen Xers, and baby boomers, members of Generation Z announced at a press conference Wednesday that actress Julie Andrews was problematic, but they refused to explain why. “You know what she did—you just don’t want to admit it,” said Gen Z spokesperson Taylor Collaco, who rolled her eyes in response to requests from those who wanted to know what exactly theSound Of Musicstar had said or done to have earned the ostracism of millions of Americans ages 12 to 27. “Yes, that Julie Andrews. Has she been so normalized that you can’t even see it? Yikes. Oh, come on, it’s not my job to educate you.” At press time, Gen Z had dropped a hint that it had something to do with the Genovian monarchy.",The Onion,2024-01-24T13:44:00+00:00,Entertainment,Satire,https://theonion.com/gen-z-announces-julie-andrews-is-problematic-but-refuse-1851180352/
5,MrBeast Announces He Has Resurrected Everyone Buried At Arlington National Cemetery,"GREENVILLE, NC—Telling viewers of his latest charitable video to prepare themselves for his “most epic challenge yet,” 25-year-old influencer Jimmy “MrBeast” Donaldson announced Friday that he had resurrected everyone buried at Arlington National Cemetery. “You might not know this, but sadly, over 400,000 of our nation’s most decorated veterans have to spend eternity dead and underground,” the content creator says in the video, explaining that he has found a scientist willing to help him reanimate the dead and has spent millions of dollars to pump 75,000 volts of electricity into each plot of the hallowed cemetery. “Thanks to this amazing procedure, every single one of these deserving corpses has crawled out of the ground and begun roaming the earth again, totally free of charge to them and their families. They’ll also receive $10,000 in cash. Frankly, it’s a tragedy we’ve let these American heroes rot in their graves for this long.” Reacting to a moment in the video when MrBeast gives a reanimated World War II veteran a Lamborghini, viewers expressed outrage toward the apparent ingratitude of the undead soldier, who only screams, “Why did you do this? Kill me…kill me!” into the camera.",The Onion,2024-02-02T13:24:00+00:00,News,Satire,https://theonion.com/mrbeast-announces-he-has-resurrected-everyone-buried-at-1851217565/
6,Introverted Cowboy Struggling To Round Up Posse,"BANDERA, TX—Admitting that he was actually a lot more shy and reserved than folks might think, introverted cowboy Cassidy Walsh sheepishly told reporters Friday that he’d been struggling lately to round up a posse. “While I might seem confident and outgoing at times, the truth is, I’m the sort of feller who needs to recharge at the end of a long day ridin’ the range with a bunch of cowhands,” said Walsh, adding that he also experienced “a might fair bit of social anxiety” that probably stemmed from a fear his attempts to organize a posse would end in rejection. “Don’t get me wrong, I enjoy spending time with ol’ buckaroos like myself. It’s just that your pal Cassidy can only handle so much hootin’ and hollerin’ before he plumb runs out of steam. Now if you’ll excuse me, I’m gonna kick up my spurs and snuggle up in my bedroll with a Louis L’Amour novel.” At press time, another successful train robbery had reportedly been carried out in the area by a tireless gang of extroverted outlaws.",The Onion,2024-02-06T15:16:00+00:00,Local,Satire,https://theonion.com/introverted-cowboy-struggling-to-round-up-posse-1851226175/
7,Country Stations Refuse To Play Beyoncé’s Music After Artist Condemns Iraq War,"HOUSTON—Calling the popular musician traitorous for failing to support President George W. Bush in a time of crisis, thousands of country stations across America reportedly refused to play Beyoncé’s music Thursday after the artist condemned the Iraq War. “If she doesn’t want to support our troops risking their lives out there for the cause of freedom, then we don’t need her,” said country radio executive Hunter Roeloffs, one of many station owners who blacklisted the recent singles “Texas Hold ’Em” and “16 Carriages” after controversial comments in which the star expressed reservations about the U.S.-led Coalition invasion of Iraq—remarks that also led to a reported drop in ticket sales and Beyoncé losing a sponsorship deal with Lipton. “Unlike Miss Knowles, we’re proud Americans here at 100.3 the Bull. We support freedom, whether it’s here or in the Middle East. So when she says innocent lives will be lost, I can’t help but wonder how she could possibly think a bloodthirsty dictator like Saddam Hussein is innocent. And then there’s that line of hers about being ashamed of President Bush? Well, we’re ashamed of her. How about that?” At press time, Beyoncé had attracted additional criticism from the country music scene after rebranding herself as the Chicks.",The Onion,2024-02-15T19:00:00+00:00,Entertainment,Satire,https://theonion.com/country-stations-refuse-to-play-beyonce-s-music-after-a-1851261135/
8,"‘Stab Him! Stab Him, You Cowards!’ Says Terrified Kamala Harris To Aides After Plunging First Knife Into Biden’s Back","WASHINGTON—Moments after pulling shut the door to the Roosevelt Room and locking it behind her, a terrified Vice President Kamala Harris reportedly told aides to “Stab him! Stab him, you cowards!” on Friday after she plunged a knife into President Joe Biden’s back. “What are you waiting for, you fools? Strike now! Strike before the opportunity goes cold!” said the blood-dappled vice president, who, as her staff appeared to grow uncertain of the blades in their shaking hands and backed away toward the exit, reminded each panicked aide in turn that they had pledged their fealty for this day. “Think of all I’ve promised you. Think of all we stand to gain. Quick, now, the first blow has been rendered. There is no going back. We’re confederates in this. We must act now or be damned by inaction!” At press time, sources confirmed President Biden had complained to an assistant of a tightness in his shoulder and returned to the Oval Office with the knife still protruding from his back.",The Onion,2024-02-16T11:15:00+00:00,Politics,Satire,https://theonion.com/stab-him-stab-him-you-cowards-says-terrified-kamal-1851243467/
9,Emerging Filmmaker Malia Obama Changes Surname To Scorsese,"PARK CITY, UT—Noting that she did not want her parents’ fame to distract from her Sundance premiere, industry sources confirmed Thursday that emerging filmmaker Malia Obama had changed her surname to ‘Scorsese.’ “Although her legal name is still Obama, Malia is officially promoting her short filmThe Heartunder the pseudonym Malia Martin Scorsese,” said Sundance spokesperson Shelby Fleming, adding that the 25-year-old had been using the more neutral, nondescript moniker since writing for Donald Glover’s television seriesSwarm. “When people see the last name Scorsese, they don’t see the daughter of a former president. They see a blank slate. She’s hopeful this slight change will help people take her art much more seriously.” At press time, Obama announced that her next film would be a gritty portrait of 1970s Little Italy titledMean Streets.",The Onion,2024-02-22T18:40:00+00:00,Entertainment,Satire,https://theonion.com/emerging-filmmaker-malia-obama-changes-surname-to-scors-1851278946/


**Babylon Bee**

The top 50 articles from the Greatest Hits page (https://babylonbee.com/news?sort=greatest-hits) have been scraped. The categories "Christian Living" and "Scripture" were excluded for being too niche. 


In [3]:
urls = [
    "https://babylonbee.com/news/trump-i-have-done-more-for-christianity-than-jesus",
    "https://babylonbee.com/news/senate-to-be-replaced-with-room-full-of-monkeys-throwing-feces",
    "https://babylonbee.com/news/motorcycle-that-identifies-as-bicycle-sets-world-cycling-record",
    "https://babylonbee.com/news/trumps-says-5-golden-tickets-to-be-hidden-among-stimulus-checks",
    "https://babylonbee.com/news/nfl-to-adorn-all-uniforms-with-lace-doilies-in-to-honor-rbg",
    "https://babylonbee.com/news/pelosi-rips-up-bible",
    "https://babylonbee.com/news/biden-cuts-holes-in-medical-mask-so-he-can-still-sniff-people",
    "https://babylonbee.com/news/man-identifying-6-year-old-crushes-game-winning-homer-tee-ball-championship",
    "https://babylonbee.com/news/biden-i-am-the-only-candidate-who-can-beat-ronald-reagan",
    "https://babylonbee.com/news/fisher-price-introduces-supreme-court-protest-playhouse-that-can-be-vandalized-and-burned-down",
    "https://babylonbee.com/news/cracker-jacks-changes-name-to-more-politically-correct-caucasian-jacks",
    "https://babylonbee.com/news/cdc-people-dirt-clintons-843-greater-risk-suicide",
    "https://babylonbee.com/news/walmart-requiring-all-shoppers-to-wear-pants",
    "https://babylonbee.com/news/ilhan-omar-withdraws-support-from-bill-to-save-the-earth-after-learning-thats-where-israel-is",
    "https://babylonbee.com/news/inspiring-celebrities-spell-out-were-all-in-this-together-with-their-yachts",
    "https://babylonbee.com/news/democrats-warn-that-american-people-may-tamper-with-next-election",
    "https://babylonbee.com/news/people-who-tweet-in-support-of-foreign-wars-to-be-automatically-enlisted-in-armed-forces",
    "https://babylonbee.com/news/bernie-sanders-praises-china-for-eradicating-poverty-by-killing-all-the-poor-people",
    "https://babylonbee.com/news/pence-cancels-general-election-to-stymie-coronavirus",
    "https://babylonbee.com/news/walmart-discontinues-sale-of-auto-parts-to-prevent-car-accidents",
    "https://babylonbee.com/news/federal-prison-hires-top-rated-italian-bodyguard-hillena-clintonelli-to-protect-ghislaine-maxwell",
    "https://babylonbee.com/news/kim-jong-un-attends-ivy-league-university-to-learn-new-brainwashing-techniques",
    "https://babylonbee.com/news/florida-recount-finally-wraps-up-al-gore-declared-president",
    "https://babylonbee.com/news/powerful-protesters-spell-out-love-with-burning-homes-and-businesses",
    "https://babylonbee.com/news/joel-osteen-tests-positive-for-heresy",
    "https://babylonbee.com/news/caravan-of-liberal-americans-makes-way-toward-socialist-paradise-of-venezuela",
    "https://babylonbee.com/news/in-genius-move-trump-supports-impeachment-forcing-democrats-to-oppose-it",
    "https://babylonbee.com/news/cnn-publishes-real-news-story-for-april-fools-day",
    "https://babylonbee.com/news/government-accidentally-shuts-itself-down-with-ban-on-non-essential-businesses",
    "https://babylonbee.com/news/wife-unaware-that-movie-will-answer-all-her-questions-if-she-just-pays-attention",
    "https://babylonbee.com/news/bernie-sanders-arrives-in-hong-kong-to-lecture-protesters-on-how-good-they-have-it-under-communism",
    "https://babylonbee.com/news/jussie-smollett-offered-job-at-cnn-after-fabricating-news-story-out-of-thin-air",
    "https://babylonbee.com/news/portland-police-wish-there-were-some-kind-of-organized-armed-force-that-could-fight-back-against-antifa",
    "https://babylonbee.com/news/to-celebrate-move-to-texas-tesla-introduces-battery-powered-ar-15",
    "https://babylonbee.com/news/genius-trump-nominates-joe-biden-to-supreme-court",
    "https://babylonbee.com/news/hillary-clinton-accidentally-posts-condolences-for-tulsi-gabbards-suicide-one-day-early",
    "https://babylonbee.com/news/twitter-shuts-down-entire-network-to-slow-spread-of-negative-biden-news",
    "https://babylonbee.com/news/celebrities-show-solidarity-with-protesters-by-burning-their-own-homes-to-the-ground",
    "https://babylonbee.com/news/lego-introduces-new-sharper-bricks-that-instantly-kill-you-when-you-step-on-them",
    "https://babylonbee.com/news/democrats-call-for-flags-to-be-flown-half-mast-to-grieve-death-of-soleimani",
    "https://babylonbee.com/news/californians-brace-for-deadly-50-degree-cold-front",
    "https://babylonbee.com/news/brilliant-trump-puts-himself-on-all-postage-stamps-forcing-democrats-to-abolish-the-usps",
    "https://babylonbee.com/news/nations-nerds-wake-up-in-utopia-where-everyone-stays-inside-sports-canceled-social-interaction-forbidden",
    "https://babylonbee.com/news/hollywood-rushes-to-make-pedophilia-acceptable-before-theyre-outed-by-ghislaine-maxwell",
    "https://babylonbee.com/news/as-part-of-settlement-with-nick-sandmann-cnn-hosts-must-wear-maga-hats-while-on-the-air",
    "https://babylonbee.com/news/biden-campaign-says-he-is-so-close-to-a-vp-pick-he-can-smell-her",
    "https://babylonbee.com/news/trump-says-to-drink-lots-of-water-media-reports-as-deranged-trump-tells-everyone-to-drown-themselves",
    "https://babylonbee.com/news/starbucks-unveils-new-satanic-holiday-cups",
    "https://babylonbee.com/news/bill-clinton-allegations-of-sexual-misconduct-should-disqualify-a-man-from-public-office",
    "https://babylonbee.com/news/joel-osteen-launches-line-pastoral-wear-sheeps-clothing"
]

def scrape_bee_article(url):
    """
    Scrapes an article from a given URL on babylonbee.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", #satire is hardcoded here as we know BabylonBee is a satire site
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        #print(soup)

        # Extract the title from the meta property "og:title"
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # Extract the URL from the meta property "og:url"
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Extract the site name from the meta property "og:site_name"
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Extract the published date from the meta property "article:published_time"        
        published_date_meta = soup.find('meta', {"name": "published_at"})
        if published_date_meta and published_date_meta.get("content"):
            # e.g., "2019-12-23 11:31:05"
            article_data["date"] = published_date_meta["content"].split()[0]
        else: "Published date not found"
        
        # Extract the category (e.g., "Politics")
        category_link = soup.find("a", href=lambda href: href and "/news/categories/" in href)
        if category_link:
            article_data["category"] = category_link.get_text(strip=True)
        else:
            article_data["category"] = "Category not found"
            
        # Extract the article text
        content_div = soup.find("div", class_="text-lg mt-6 leading-6 text-gray-700 article-content mx-2 sm:mx-0")
        if content_div:
            paragraphs = content_div.find_all("p")
            full_text = " ".join(p.get_text(strip=False) for p in paragraphs)
            article_data["text"] = full_text.strip()
        else:
            article_data["text"] = "Article text not found"
    
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data


def scrape_multiple_bee_articles(urls):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped data from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_bee_article(url)
        articles.append(article)
    return pd.DataFrame(articles)

# Scrape articles and create a DataFrame
bee_data_df = scrape_multiple_bee_articles(urls)
# Store to CSV
bee_data_df.to_csv("satire_scraped_articles_bee.csv", index=False)
# Print df 
bee_data_df

Unnamed: 0,title,text,site,date,category,class,url
0,Trump: 'I Have Done More For Christianity Than Jesus',"WASHINGTON, D.C. - In response to the Christianity Today editorial calling for his removal, Trump called the magazine a ""left-wing rag"" and said, ""I have done more for Christianity than Jesus."" ""I mean, the name of the magazine is Christianity Today, and who is doing more for Christians today? Not Jesus. He disappeared; no one knows what happened to him. But I'm out there every day protecting churches from crazy liberals."" While Trump admitted that Jesus did do some things for Christianity in the past, Trump said he was doing more now and it was more substantial. ""I'm appointing judges to help protect religious rights,"" Trump stated. ""How many judges has Jesus appointed? He says something about judging people in the future, but I ain't seen it."" Furthermore, Trump asserted that he ""saved Christmas."" ""Look what I've done,"" he said. ""You can say 'Merry Christmas' now. In fact, if you say 'Happy Holidays' and don't immediately make it clear you're referring to Christmas, you go to prison. What has Jesus ever done for Christmas? Be born? He wants credit for that? Come on.""",The Babylon Bee,2019-12-23,Politics,Satire,https://babylonbee.com/news/trump-i-have-done-more-for-christianity-than-jesus
1,Senate To Be Replaced With Room Full Of Monkeys Throwing Feces,"WASHINGTON, D.C. - In an emergency, overnight referendum, the American people voted on Thursday to replace the United States Senate with a room full of monkeys throwing feces. The measure passed with 57% of the vote. 22% of voters thought the Senate should be replaced by barking seals, while 17% voted that the replacement should be the pit of venomous snakes from Indiana Jones. 3.97% voted that Senate members be replaced by screaming goats. ""About 100 people"" voted for the current Senators to keep their jobs, with this tiny voting bloc centered in Washington, D.C. Highland Ape Rescue out of West Virginia will be teaming up with Cornwell Primate farms to supply hundreds of monkeys and apes to the Senate. The animals will be fed a nutritious mixture of foods that produce easily throwable feces. Protective glass will be put up around the Senate for camera crews to safely film, but anyone being interviewed by the new senators will have to sit in the middle of the poo-flinging octagon, coming under a heavy barrage of projectile excrement. ""It will be a huge improvement from how things were before,"" said ape trainer, Marlena Henwick. ""No more 10-12 hour hearings. With these monkeys, all the fecal projectiles will have been flung in under 30 minutes. One and done."" The recently replaced senators will be placed on display at the National Zoo in Washington, D.C. for families to observe and zoologists to study.",The Babylon Bee,2018-09-28,Politics,Satire,https://babylonbee.com/news/senate-to-be-replaced-with-room-full-of-monkeys-throwing-feces
2,Motorcyclist Who Identifies As Bicyclist Sets Cycling World Record,"NEW YORK, NY - In an inspiring story from the world of professional cycling, a motorcyclist who identifies as a bicyclist has crushed all the regular bicyclists, setting an unbelievable world record. In a local qualifying race for the World Road Cycling League, the motorcyclist crushed the previous 100-mile record of 3 hours, 13 minutes with his amazing new score of well under an hour. Professional motorcycle racer Judd E. Banner, the brave trans-vehicle rider, was allowed to race after he told league organizers he's always felt like a bicyclist in a motorcyclist's body. ""Look, my ride has handlebars, two wheels, and a seat,"" he told reporters as he accepted a trophy for his incredible time trial. ""Just because I've got a little extra hardware, such as an 1170-cc flat-twin engine with 110 horsepower, doesn't mean I have any kind of inherent advantage here."" Banner also said he painted the word ""HUFFY"" on the side of his bike, ensuring he has no advantage over the bikes that came out of the factory as bicycles. Some critics say he needs to cut off his motor in order to make the competition fairer, but he quickly called these people bigots, and they were immediately banned from professional cycle racing.",The Babylon Bee,2019-10-25,Sports,Satire,https://babylonbee.com/news/motorcycle-that-identifies-as-bicycle-sets-world-cycling-record
3,Trump Announces He Has Hidden 5 Golden Tickets Among Stimulus Checks,"WASHINGTON, D.C. - Trump has built up a lot of buzz over the coming stimulus payments, saying he has hidden five golden tickets among the checks heading to Americans this week. Anyone who gets a golden ticket will win a free tour of Mar-a-Lago. Rumor has it that Trump will be watching them closely to see which of the winners has the qualities he looks for in a manager, with the best candidate getting hired as Mar-a-Lago's onsite McDonald's manager. ""Who will win? Nobody knows!"" Trump said gleefully as he carefully signed each of the golden tickets before hiding them among the stimulus checks. ""I, Donald Trump, have decided to allow five Americans - just five, mind you, and no more - to visit my resort this year. These lucky five will be shown around personally by me, and they will be allowed to see all the secrets and the magic of my hotel and golf resort -- the best golf, maybe ever. Then, at the end of the tour, as a special present, all of them will be given Season 1 of The Apprentice on DVD!"" ""So watch out for the Golden Tickets! Five Golden Tickets have been printed on golden paper, and these five Golden Tickets have been hidden in your stimulus checks. These five may be anywhere - in any mailbox in the country. And the five lucky finders of these five Golden Tickets are the only ones who will be allowed to visit my Mar-a-Lago during the lockdown. Good luck to you all!"" Unfortunately, he put all five golden tickets in a stimulus envelope addressed to Jim Acosta.",The Babylon Bee,2020-04-15,Politics,Satire,https://babylonbee.com/news/trumps-says-5-golden-tickets-to-be-hidden-among-stimulus-checks
4,NBA Players Wear Special Lace Collars To Honor Ruth Bader Ginsburg,"LOS ANGELES, CA - NBA players are honoring the life of Ruth Bader Ginsburg this week by wearing pretty lace collars just like Notorious RBG used to wear. In a touching show of respect for the late Justice Ginsburg, and in solidarity with her progressive cause, Lebron James and the LA Lakers took to the court yesterday wearing a stunning variety of delicate white collars inspired by RBG's wardrobe. According to several commentators on ESPN, the virtual teleconference crowd fell silent in reverent awe as the players all knelt down and chanted ""RBG! RBG! RBG!"" ""Yeah, RBG was an amazing person,"" said LeBron James after the game. ""I have her biography right here and I totally read it right before the game. She was a judge. That's cool, I respect that. Judges judge things and not everyone can do that. She believed in Black Lives Matter and being on the right side of history and stuff."" Power forward Anthony Davis also expressed his happiness with the collars. ""It's good to honor her today with these lacey things. Commissioner Adam Silver and President Xi Jinping told us to wear them so we did. I just took this little doily thing from under a table lamp at my mom's house and cut a hole in the middle. Easy."" NBA players are vowing to wear the collars until Trump is removed from office, or until angry rioters burn their basketball arenas down, whichever comes first.",The Babylon Bee,2020-09-22,Politics,Satire,https://babylonbee.com/news/nfl-to-adorn-all-uniforms-with-lace-doilies-in-to-honor-rbg
5,"In Bold Anti-Trump Statement, Pelosi Rips Up Bible","WASHINGTON, D.C. - In a bold, powerful statement to oppose Trump, Speaker of the House Nancy Pelosi solemnly tore up the Bible after Trump was seen holding one up in front of a church. At a press conference, the Speaker of the House held up a Bible and then ripped it in two, declaring that she was against anything Trump was associated with. ""If Trump is for the Bible, then I am against it,"" she said as she struggled to rip the Bible in half. Finally, aides came to intervene, pre-ripping the spine of the Bible so it would be easier for her to tear. ""All the books of the Bible are bad: Genesis, Joseph, the one with the big fish, even Hezekiah. We must stand against Trump's bigotry by ripping up anything he claims to be for."" ""Yass, queen! Slay!"" shouted her fans at the press conference as she finally managed to rip the Bible up. ""You're my president!"" In a genius move, Trump then held up a Koran in front of a mosque, forcing Pelosi to tear up a Koran and alienate the left.",The Babylon Bee,2020-06-03,Politics,Satire,https://babylonbee.com/news/pelosi-rips-up-bible
6,Biden Cuts Hole In Mask So He Can Still Sniff People's Hair,"WASHINGTON, D.C. - Joe Biden has committed to wearing a mask in public to be a good example and to prevent the spread of COVID-19. Aides were disappointed and a little frightened, however, when Biden immediately cut a large hole in the middle of the mask so he could continue to invade people's personal space and sniff their hair, necks, and faces. Staffers usually don't let Biden play with sharp objects, but he managed to find some safety scissors stashed behind the Metamucil in his campaign bus. Using the purple plastic scissors, he cut a large hole and then fitted the mask to his face, confident that he was protecting himself and others from the virus. ""That's better,"" he said as he cut a big hole for his schnoz. ""Now I'm protecting against infection and I'm still able to give the ladies a good sniff. You know, in my day, I wore a mask just like this, as was the fashion at the time. All the kids at the pool would ask to play with the mask, and they'd run their fingers through it. In fact, one time, a gangster named CornPop was about to go cause some trouble at the sock hop, and I put some rocks in my mask and started swinging it around like a sling. You know, real Daniel and Goliath type stuff. He looked at me, tears in his eyes, and promised never again to go out and cause a ruckus."" ""Anyway, that's why I'm your best choice for senator of the Roman Empire. Vote for Joe!"" Biden suddenly came to and realized he was standing in a Walmart parking lot talking to a hobo.",The Babylon Bee,2020-04-09,Politics,Satire,https://babylonbee.com/news/biden-cuts-holes-in-medical-mask-so-he-can-still-sniff-people
7,Man Identifying As 6-Year-Old Crushes Game-Winning Homer In Tee-Ball Championship,"AUBURN, CA - Local 36-year-old man Nate Ripley, who identifies as a six-year-old, ""absolutely crushed"" a game-winning homer at a local tee-ball game and won the championship for his team Monday evening, reports confirmed. Ripley reportedly walked up to the plate in the bottom of the 6th, pointed his bat toward the left-field wall looming 130 feet in the distance, and let her rip, sending the ball rocketing over the fence and into a parking lot as the fans cheered and his coach yelled out, ""Attaboy, Nate! Good job, bud!"" His team, the Lil' Padres, attempted to hoist him up on their shoulders in celebration of their great victory over the favored Tiny Tigers, but were unable to pick up the large 230-pound man. Ripley's feat comes at the end of a momentous tee-ball season, in which the self-identified six-year-old absolutely shattered every record set prior to that point. With a 1.000 batting average, 52 home runs, and an incredible showing at first base, second base, shortstop, third base, and pitcher, the man is being called an inspiration to other six-year-olds everywhere. ""I'm just proud to be here with my team. It's all for the love of the game,"" an emotional Ripley told reporters while enjoying an orange slice and juice box after the championship. ""I couldn't have done it without my team.""",The Babylon Bee,2017-06-06,Lifestyle,Satire,https://babylonbee.com/news/man-identifying-6-year-old-crushes-game-winning-homer-tee-ball-championship
8,Biden: 'I Am The Only Candidate Who Can Beat Ronald Reagan',"HOUSTON, TX - Fresh off his afternoon nap, presidential candidate Joe Biden gave a fiery, high-energy speech in Houston today, claiming to be the only candidate who could beat incumbent Ronald Reagan. ""I am the only candidate who can unite the party to defeat Reagan,"" he said to scattered applause. ""When Super Thursday hits here in a few weeks, we can rally the 150 million Democrats here in the great country of Texas to vote for me so we can get Reagan and his crony Dick Cheney off the Iron Throne there in the Imperial Senate. Go Hoosiers!"" Aides scrambled to turn off Biden's mic but he beat them away with his walker. ""The time has come for the reign of Tippecanoe and Tyler too to end!"" he shouted, though by this point he had wandered into a nearby field and no one could hear him.",The Babylon Bee,2020-03-02,Politics,Satire,https://babylonbee.com/news/biden-i-am-the-only-candidate-who-can-beat-ronald-reagan
9,Fisher-Price Releases 'My First Peaceful Protest' Playset With House You Can Actually Burn Down,"EAST AURORA, NY - The toy geniuses at Fisher-Price have announced a brand new toy made just for leftist parents and their kids: the My First Peaceful Protest playset. The kid-size clubhouse will come with several varieties of spray paint so kids can tag the tiny building with their own empowering slogans. It will also be made out of cardboard, allowing the cute little tikes to burn the whole thing down if their demands are not met. ""Here at Fisher-Price, we are steadfastly committed to social justice,"" said toy designer Camden Flufferton. ""We need to teach our kids what democracy looks like, and there's no better example of democracy in action than violent vandalism and arson. We hope this new playset will serve as an inspiration for parents wanting to teach their kids how to threaten citizens with violence whenever their demands are not met."" The set will also come with toy televisions, cell phones, jewelry, and clothing, allowing kids to simulate looting before they torch the entire set. The set will be available in stores for $399 because of capitalism. Experts are questioning the wisdom of this move by Fisher-Price, mainly because people in the target market don't typically have any kids. ""We know we'll probably only sell, like, 3 of these,"" said Flufferton, ""but selling them isn't the point. We just need you to know we're on the right side of history.""",The Babylon Bee,2020-09-21,Politics,Satire,https://babylonbee.com/news/fisher-price-introduces-supreme-court-protest-playhouse-that-can-be-vandalized-and-burned-down
