# Al Jazeera Middle East News Scraper

scraping 2000 news articles per month from [Al Jazeera Middle East](https://www.aljazeera.com/middle-east/) for 10 years (2012-2022) using the Firecrawl API.

In [6]:
# install firecrawl if not installed 
%pip install firecrawl
%pip install bs4

Note: you may need to restart the kernel to use updated packages.
Collecting bs4
  Using cached bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting beautifulsoup4 (from bs4)
  Using cached beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Using cached soupsieve-2.7-py3-none-any.whl.metadata (4.6 kB)
Using cached bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Using cached beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)
Using cached soupsieve-2.7-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.13.4 bs4-0.0.2 soupsieve-2.7
Note: you may need to restart the kernel to use updated packages.


In [None]:
FIRECRAWL_API_KEY = 'fc-ce38ce36fb854e9197959647181a8163'

## Scraping Logic

- We need maximum of 500 articles per week, so 2000 per month
- For each month, paginate through the news archive to collect up to 2000 articles.
- Save the title, content, location, and time to a CSV file.

In [7]:
import requests
from bs4 import BeautifulSoup
import csv
import time

base_archive_url = "https://studies.aljazeera.net/en/news/archive/"
years = list(range(2012, 2023))  # 2012 to 2022

def get_article_links(year):
    url = f"{base_archive_url}{year}"
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, "html.parser")
    articles = []
    for card in soup.select(".views-row"):
        title_tag = card.select_one("h3 a")
        date_tag = card.select_one(".date")
        if title_tag and date_tag:
            title = title_tag.get_text(strip=True)
            link = "https://studies.aljazeera.net" + title_tag['href']
            date = date_tag.get_text(strip=True)
            articles.append({"title": title, "url": link, "date": date})
    return articles

def get_article_content(url):
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, "html.parser")
    content_tag = soup.select_one(".field--name-body")
    content = content_tag.get_text(separator="\n", strip=True) if content_tag else ""
    # Location is not always present; you may need to adjust this selector
    location = ""
    return content, location

with open("aljazeera_studies_2012_2022.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["Title", "Content", "Location", "Date", "URL"])
    for year in years:
        print(f"Scraping year: {year}")
        articles = get_article_links(year)
        for article in articles:
            content, location = get_article_content(article["url"])
            writer.writerow([
                article["title"],
                content,
                location,
                article["date"],
                article["url"]
            ])
            time.sleep(0.5)  # be polite to the server

print("Done!")

Scraping year: 2012
Scraping year: 2013
Scraping year: 2014
Scraping year: 2015
Scraping year: 2016
Scraping year: 2017
Scraping year: 2018
Scraping year: 2019
Scraping year: 2020
Scraping year: 2021
Scraping year: 2022
Done!
