# Scraping Data from The Gaurdian
In this tutorial, I will walk through the steps one needs to take to use the `Requests` and `BeautifulSoup` libraries to scrape data from a static website, then compile a dataframe and csv. I demostrate this by scraping headline data from the online newspaper "The Guardian". The documentation for BeatuifulSoup can be found [here](https://beautiful-soup-4.readthedocs.io/en/latest/).

In [8]:
# 1. import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [9]:
# 2. copy and paste the URL of the static website you wish to scrape from:
URL = "https://www.theguardian.com/us"

In [10]:
# 3. send a GET request to the URL:
page = requests.get(URL)

In [11]:
# 4. familiarize yourself with the website you wish to scrape from by using the developer tools, then use the 
    # BeautifulSoup functions to get started scraping:
soup = BeautifulSoup(page.content, "html.parser")
main = soup.find('main')
sections = main.find_all('section')

In [12]:
# 5. extract the desired data from the HTML using BeautifulSoup functions and add data to a pandas dataframe: 

# initialize a pandas df
df = pd.DataFrame({'Section Name':[], 'Headline': [], 'Time':[], 'Link':[]})

for section in sections:
    # extract the newspaper section name
    if 'id' in section.attrs:
        sect_name = section['id']
    else:
        continue
    
    # each headline is associated with a link
    for link in section.find_all('a'):
        # extract the headline
        if 'aria-label' in link.attrs:
            headline = link['aria-label']
        else:
            headline = ''
            for string in link.strings:
                headline += string + ': '
            headline = headline.strip()

        # extract the time the headline was published if the information is available
        if link.parent.find('time'):
            time = link.parent.find('time')['title']
        else:
            time = None

        newrow = [sect_name, headline, time, link['href']]
        df.loc[len(df.index)] = newrow

In [13]:
# 6. observe the final dataframe:
df

Unnamed: 0,Section Name,Headline,Time,Link
0,headlines,Grounded 737 Max 9 planes can return to serv...,"Thursday, 25 January 2024 at 00:30 Coordinated...",/business/2024/jan/24/boeing-ceo-plane-safety
1,headlines,More than one-third of Americans believe Israe...,"Wednesday, 24 January 2024 at 22:21 Coordinate...",/us-news/2024/jan/24/americans-believe-israel-...
2,headlines,United Auto Workers union endorses Joe Biden f...,"Wednesday, 24 January 2024 at 19:28 Coordinate...",/us-news/2024/jan/24/united-auto-workers-endor...
3,headlines,New Hampshire: Primary set turnout record with...,,/us-news/2024/jan/24/new-hampshire-primary-rec...
4,headlines,No viable route to election: Upbeat Haley vows...,,/us-news/2024/jan/24/nikki-haley-donald-trump-...
...,...,...,...,...
178,tabs-popular-0,Bodies of six people found at remote crossroad...,,/us-news/2024/jan/24/bodies-found-mojave-deser...
179,tabs-popular-0,Chair of Arizona Republican party resigns afte...,,/us-news/2024/jan/24/arizona-republican-jeff-d...
180,tabs-popular-0,Bank of America sends warning letters to emplo...,,/money/2024/jan/24/bank-of-america-warning-let...
181,tabs-popular-0,Lindsey Graham ‘threw Trump under the bus’ in ...,,/books/2024/jan/24/find-me-the-votes-book-grah...


In [14]:
# 7. write dataframe object to a csv:
df.to_csv('scraped_guardian.csv')