# Data Scraping 
## Import Libraries
We start by importing all the needed libraries for scraping news data and then checking if the versions match.

In [15]:
import sklearn
import pandas
import seaborn
import requests
from bs4 import BeautifulSoup

print("scikit-learn version:", sklearn.__version__)     # 1.6.1
print("pandas version:", pandas.__version__)            # 2.2.3
print("seaborn version:", seaborn.__version__)          # 0.13.2
print("requests version:", requests.__version__)        # 2.31.0

scikit-learn version: 1.6.1
pandas version: 2.2.3
seaborn version: 0.13.2
requests version: 2.31.0


## Article Details
I plan to examine each article's:
- title
- link
- source
- author (if available)
- publication date
- text/content
- "Fake" (0) or "Real" (1)

## Truthful websites
### BBC
I believe that starting with the BBC would be the easiest, as I did my research and found out that they were ranked as the most trusted news source by Americans. Even though we live in Europe, I choose to believe this statistic.

In [3]:
bbc_url = 'https://www.bbc.com'

def scrape_article_details(link):
    response = requests.get(link)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')

    # article text
    article_text = ''
    article_body = soup.find('article')
    if article_body:
        paragraphs = article_body.find_all('p')  #find all p tags
        for p in paragraphs:
            article_text += p.get_text(strip=True) + '\n\n'

    # author
    author_tag = soup.find('span', class_='sc-b42e7a8f-7 kItaYD')
    author = author_tag.get_text(strip=True) if author_tag else ''

    # publication date
    time_tag = soup.find('time', {'datetime': True})
    publication_date = time_tag['datetime'] if time_tag else ''

    return article_text, author, publication_date

def scrape_news():
    url = f'{bbc_url}/news'

    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')

    articles = []
    for article in soup.find_all('a', class_='sc-2e6baa30-0 gILusN'):
        title_element = article.find('h2', class_='sc-87075214-3 eywmDE')
        if title_element:
            title = title_element.get_text(strip=True)
            href = article['href']

            # check if href is a full URL or relative path
            if href.startswith('http'):
                link = href
            else:
                link = f"{bbc_url}{href}"

            # additional details
            article_text, author, publication_date = scrape_article_details(link)

            articles.append({
                'title': title,
                'link': link,
                'source': 'BBC',
                'author': author,
                'date': publication_date,
                'text': article_text,
                'classification': 1
            })

    return articles

news_articles = scrape_news()

# save to CSV
df = pandas.DataFrame(news_articles)
df.to_csv("bbc_news_articles_v1.csv", index=False)

print("Saved to bbc_news_articles_v1.csv")

Saved to bbc_news_articles_v1.csv


## Fake websites

For now, I decided to generateb fake data for the fake news. As I looked extensively into datasets with fake news and most of them are 6-7 years old, which might not be still relevant. At this state of the project - iteration 0 - I decided to go with this fake data.

In [22]:
fake_dataset = pandas.read_csv('fake_news_articles_v1.csv')
fake_dataset.sample(5)

Unnamed: 0,title,link,source,author,date,text,classification
25,Sort marriage amount.,https://www.graves-lopez.com/personal-movement,Scott Inc News,James Craig,3/18/2025,Who force series movement tax will specific. B...,0
24,Shoulder agent say per enough.,http://www.ritter-daniel.com/his-bring-only,Larson Ltd News,Victor Snow,2/11/2022,Behind enjoy they trip theory piece season. Ah...,0
1,Perform same tough risk authority store.,https://www.fernandez.net/thus-item-executive,Hanson LLC News,Alan Lewis,4/2/2024,Card whom history position learn leave rate. R...,0
37,Specific with agency.,https://scott-gates.info/indeed-vote-tend,Keller-Brewer News,Sarah Turner,2/6/2024,Manager bar reduce. Sing individual may floor ...,0
27,A TV within reduce sort concern different meas...,https://www.fleming.com/religious-state-my,Maddox-Anderson News,Tamara Rodriguez,9/2/2024,It open billion democratic. Partner activity s...,0
