# Data Scraping ⛏️

Since this is my first attempt at scraping data, I will begin by examining only trustworthy news outlets. I believe that starting with the BBC would be the easiest, as I did my research and found out that they were ranked as the most trusted news source by Americans. Even though we live in Europe, I choose to believe this statistic. Later on, I plan to expand my list of trustworthy sources, but for now, this is my starting point, so bare with me.

First, lets begin by importing the libraries, we are going to use today and checking their versions.

In [33]:
import sklearn
import pandas
import seaborn
import requests
from bs4 import BeautifulSoup

print("scikit-learn version:", sklearn.__version__)     # 1.6.1
print("pandas version:", pandas.__version__)            # 2.2.3
print("seaborn version:", seaborn.__version__)          # 0.13.2
print("requests version:", requests.__version__)        # 2.31.0

scikit-learn version: 1.6.1
pandas version: 2.2.3
seaborn version: 0.13.2
requests version: 2.31.0


Now, lets get to the actual thing!! Firstly, we are doing this with the titles and links.

In [39]:
bbc_url = 'https://www.bbc.com'

def scrape_news():
    url = bbc_url + '/news'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    articles = []
    for article in soup.find_all('a', class_='sc-2e6baa30-0 gILusN'):
        title_element = article.find('h2', class_='sc-87075214-3 eywmDE')
        if title_element:
            title = title_element.text.strip()
            link = bbc_url + article['href']
            articles.append({'title': title, 'link': link})
    
    return articles

news_articles = scrape_news()

for article in news_articles:
    print(f"Title: {article['title']}")
    print(f"Link: {article['link']}")
    print()

Title: Pope Francis to be discharged from hospital on Sunday
Link: https://www.bbc.com/news/articles/crrdv84rg4do

Title: Heavyweight boxing legend George Foreman dies aged 76
Link: https://www.bbc.com/news/articles/ckg8ez8201yo

Title: Three killed and 15 injured in New Mexico mass shooting
Link: https://www.bbc.com/news/articles/cn0jn4jzj11o

Title: At least two dead as wildfires rage in South Korea
Link: https://www.bbc.com/news/articles/cdx2801qegvo

Title: Pope Francis to be discharged from hospital on Sunday
Link: https://www.bbc.com/news/articles/crrdv84rg4do

Title: Protesters in Turkey rally 'for justice' after mayor's arrest
Link: https://www.bbc.com/news/articles/c0egjvj8vdro

Title: Hillary Clinton and Kamala Harris lose security clearance after Trump order
Link: https://www.bbc.com/news/articles/c74kg3e2m08o

Title: Facebook to stop targeting ads at UK woman after legal fight
Link: https://www.bbc.com/news/articles/c1en1yjv4dpo

Title: Heavyweight boxing legend George Fore

Now, lets put the titles and links in a csv file.

In [40]:
df = pandas.DataFrame(news_articles)
df.to_csv("bbc_news_articles.csv", index=False)
print("Saved to bbc_news_articles.csv")

Saved to bbc_news_articles.csv
