# Data Scraping ⛏️

Since this is my first attempt at scraping data, I will begin by examining only trustworthy news outlets. I believe that starting with the BBC would be the easiest, as I did my research and found out that they were ranked as the 4th most trusted news by Americans. Even though we live in Europe, I choose to believe this statistic. Later on, I plan to expand my list of trustworthy sources, but for now, this is my starting point, so bare with me.

First, lets begin by importing the libraries, we are going to use today and checking each of their versions.

In [None]:
import sklearn
import pandas as pd
import seaborn
import requests
from bs4 import BeautifulSoup

print("scikit-learn version:", sklearn.__version__)     # 1.6.1
print("pandas version:", pd.__version__)                # 2.2.3
print("seaborn version:", seaborn.__version__)          # 0.13.2
print("requests version:", requests.__version__)        # 2.31.0

scikit-learn version: 1.6.1
pandas version: 2.2.3
seaborn version: 0.13.2
requests version: 2.31.0


Now, lets get to the actual thing!!

In [None]:
# Import libraries
import sklearn
import pandas as pd
import seaborn
import requests
from bs4 import BeautifulSoup  # Correct usage of BeautifulSoup

# Print library versions
print("scikit-learn version:", sklearn.__version__)  # 1.6.1
print("pandas version:", pd.__version__)             # 2.2.3
print("seaborn version:", seaborn.__version__)       # 0.13.2
print("requests version:", requests.__version__)     # 2.31.0

# Function to scrape titles and links
def scrape_news_titles_and_links(url, title_selector, link_selector):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for HTTP errors

        soup = BeautifulSoup(response.text, 'html.parser')  # Proper usage of BeautifulSoup
        titles = [title.get_text(strip=True) for title in soup.select(title_selector)]
        links = [link['href'] for link in soup.select(link_selector) if link.has_attr('href')]

        # Combine titles and links into a list of dictionaries
        articles = [{"Title": t, "Link": l} for t, l in zip(titles, links)]
        return articles
    except Exception as e:
        print(f"An error occurred: {e}")
        return []

# Example website (BBC News)
url = "https://www.bbc.com/news"

# Define CSS selectors for titles and links
title_selector = ".gs-c-promo-heading__title"  # Adjusted CSS selector for titles
link_selector = ".gs-c-promo-heading"         # Adjusted CSS selector for links

# Scrape titles and links
articles = scrape_news_titles_and_links(url, title_selector, link_selector)

# Convert the articles to a DataFrame and save to a CSV file
if articles:
    data = pd.DataFrame(articles)
    data.to_csv("truthful_news_titles_and_links.csv", index=False)
    print("Article metadata saved to 'truthful_news_titles_and_links.csv'")
    print(data.head())
else:
    print("No articles were scraped. Check your CSS selectors.")

No articles were scraped. Check your CSS selectors.
