<a href="https://colab.research.google.com/github/luiseduardoballarati/MSc-CS-Dissertation/blob/main/The_Guardian_Scraper_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The news scraper

Collecting information from the Guardian, using its API. The data come from https://open-platform.theguardian.com. The information collected can be only used for academic purposes and you need to have an API Key (you can request one on the website).

Below, the necessary imports (I've used Google Colabs Notebooks):

In [None]:
import requests
import json
from google.colab import files
import pandas as pd
from bs4 import BeautifulSoup

The funtion *fetch_guardian_data* takes the API Key, the tags of the articles, the dates and the page size and page. It returns in json format the article type, sectionId, sectionName, the date, the title, the url and the content.

It then converts to a pandas dataframe, printing the Error (Error 400 indicates that the maximumn amount of articles one can scrape per day was hit), the number of articles collected, the time taken and the date range. Other errors might indicate sucess as well. Check their documentation to understand: https://open-platform.theguardian.com/documentation/.  

In [None]:
import time

# Start the timer
start_time = time.time()

api_key = 'YOUR_API_KEY_HERE'

tag = 'business/business'
from_date = '2015-02-23'
to_date = '2016-02-25'
page_size = 50  # Number of articles per page

# Function to fetch data from the Guardian API
def fetch_guardian_data(api_key, tag, from_date, to_date, page, page_size):
    url = f'https://content.guardianapis.com/search?tag={tag}&from-date={from_date}&to-date={to_date}&show-fields=bodyText&page={page}&page-size={page_size}&api-key={api_key}'
    response = requests.get(url)

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code}")
        return None

all_results = []
page = 1

while True:
    data = fetch_guardian_data(api_key, tag, from_date, to_date, page, page_size)
    if data:
        results = data.get('response', {}).get('results', [])
        if not results:
            break
        all_results.extend(results)
        page += 1
    else:
        break

# Extract the desired fields from the collected results
extracted_data = [
    {
        'type': item.get('type'),
        'sectionId': item.get('sectionId'),
        'sectionName': item.get('sectionName'),
        'webPublicationDate': item.get('webPublicationDate'),
        'webTitle': item.get('webTitle'),
        'webUrl': item.get('webUrl'),
        'content': item.get('fields', {}).get('bodyText')
    }
    for item in all_results
]

# Create a DataFrame
df = pd.DataFrame(extracted_data)

# End the timer
end_time = time.time()

# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Time taken to scrape the articles: {elapsed_time} seconds")
print(f"Articles scrapped: {df.shape[0]}")
print(f"Dates: from: {df['webPublicationDate'].min()} to {df['webPublicationDate'].max()}")
df.to_csv('business_guardian_articles_11.csv', index=False)

Error: 400
Time taken to scrape the articles: 175.75619220733643 seconds
Articles scrapped: 10302
Dates: from: 2015-02-23T00:01:00Z to 2016-02-25T22:04:15Z
