# Data Scraping ⛏️

Let's begin by importing the libraries and checking their versions.

In [29]:
import sklearn
import pandas
import seaborn
import requests
from bs4 import BeautifulSoup

print("scikit-learn version:", sklearn.__version__)     # 1.6.1
print("pandas version:", pandas.__version__)            # 2.2.3
print("seaborn version:", seaborn.__version__)          # 0.13.2
print("requests version:", requests.__version__)        # 2.31.0

scikit-learn version: 1.6.1
pandas version: 2.2.3
seaborn version: 0.13.2
requests version: 2.31.0


## Article Details 🔎
I will examine each article's 
- Title 
- Link 
- Author (if available)
- Publication Date
- Content/text
- Categorize it as "Fake" (0) or "Real" (1).

### BBC 🌐

I believe that starting with the BBC would be the most straightforward option, as my research shows they are ranked as the most trusted news source by Americans. Even though we live in Europe, I choose to trust this statistic.

In [None]:
bbc_url = 'https://www.bbc.com'

def scrape_article_details(link):
    response = requests.get(link)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')

    # article text
    article_text = ''
    article_body = soup.find('article')
    if article_body:
        paragraphs = article_body.find_all('p')  #find all p tags
        for p in paragraphs:
            article_text += p.get_text(strip=True) + '\n\n'

    # author
    author_tag = soup.find('span', class_='sc-b42e7a8f-7 kItaYD')
    author = author_tag.get_text(strip=True) if author_tag else ''

    # publication date
    time_tag = soup.find('time', {'datetime': True})
    publication_date = time_tag['datetime'] if time_tag else ''

    return article_text, author, publication_date

def scrape_news():
    url = f'{bbc_url}/news'

    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')

    articles = []
    for article in soup.find_all('a', class_='sc-2e6baa30-0 gILusN'):
        title_element = article.find('h2', class_='sc-87075214-3 eywmDE')
        if title_element:
            title = title_element.get_text(strip=True)
            href = article['href']

            # check if href is a full URL or relative path
            if href.startswith('http'):
                link = href
            else:
                link = f"{bbc_url}{href}"

            # additional details
            article_text, author, publication_date = scrape_article_details(link)

            articles.append({
                'title': title,
                'link': link,
                'author': author,
                'date': publication_date,
                'text': article_text,
                'classification': 1
            })

    return articles

news_articles = scrape_news()

# save to CSV
df = pandas.DataFrame(news_articles)
df.to_csv("bbc_news_articles_v2.csv", index=False)

print("Saved to bbc_news_articles_v2.csv")

Saved to bbc_news_articles_v2.csv


### Other dataset

For now, I have decided to use another dataset I found on Hugging Face (https://huggingface.co/datasets/ErfanMoosaviMonazzah/fake-news-detection-dataset-English). Although the data is 6-7 years old and may not be fully relevant today, it contains valuable information that fits my current assignment. Since this is the initial stage of the project — Iteration 0 — I will start with this dataset, and later on, I plan to enhance it by scraping more relevant sources. The dataset includes both fake and real news, but it does not provide links. For the time being, I will train the model to work with the text and titles, with the intention of improving it later.

In [None]:
bbc_dataset = pandas.read_csv('huggingface_dataset.csv')
bbc_dataset.shape

(29999, 6)