Author: Jedidah Wavinya

---

Date: 23rd October 2025


---


* With so many different news channels popping up, it is becoming increasingly difficult to keep track of all kinds of news that highlight relevant happenings worldwide.
* We all have our favorite for news channels, but no one channel has it all.

* This web scraping project will involve building a customized one-stop solution for relevant news from all around the world.

1. Colab setup & installs

In [1]:
# Install libraries
!pip install newspaper3k feedparser beautifulsoup4 requests lxml python-dateutil tqdm

# newspaper3k requires punkt tokenizer
import nltk
nltk.download('punkt')

Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting feedparser
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tinysegmenter==0.3 (from newspaper3k)
  Downloading tinysegmenter-0.3.tar.gz (16 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sgmllib3k (from feedp

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

2. Imports & helper utilities

In [2]:
!pip install lxml_html_clean
import pandas as pd
import numpy as np
import feedparser
import requests
from bs4 import BeautifulSoup
from newspaper import Article
import csv
import time
from urllib.parse import urlparse
from datetime import datetime
from dateutil import parser as dateparser
from tqdm import tqdm
import os

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.4.3-py3-none-any.whl.metadata (2.3 kB)
Downloading lxml_html_clean-0.4.3-py3-none-any.whl (14 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.4.3


3. Politeness: robots.txt check + user-agent + rate limiting

In [3]:
# Simple user agent
HEADERS = {"User-Agent": "Wavinya-News-Aggregator/1.0 (+https://example.com)"}

# Check robots.txt for disallow rules (very simple check)
def is_allowed(url, user_agent="*"):
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    try:
        r = requests.get(robots_url, headers=HEADERS, timeout=8)
        if r.status_code != 200:
            return True  # no robots.txt -> assume allowed (still be polite)
        rules = r.text.splitlines()
        ua = None
        disallowed = []
        for line in rules:
            line = line.strip()
            if not line or line.startswith('#'):
                continue
            if line.lower().startswith('user-agent'):
                ua = line.split(':',1)[1].strip()
            elif line.lower().startswith('disallow') and ua in (user_agent, '*'):
                path = line.split(':',1)[1].strip()
                disallowed.append(path)
        # naive check: if any disallowed path is prefix of url.path -> block
        for d in disallowed:
            if d and parsed.path.startswith(d):
                return False
        return True
    except Exception as e:
        # if robots unreachable, be conservative but allow
        return True

# Example usage
print(is_allowed("https://www.bbc.com/news"))


True


4. RSS-first approach (recommended)
* Use RSS feeds where possible — faster, designed for consumption.

In [4]:
# Example RSS feeds: you can add/remove as desired
RSS_FEEDS = {
    "BBC": "http://feeds.bbci.co.uk/news/rss.xml",
    "Reuters": "http://feeds.reuters.com/reuters/topNews",
    "TheGuardian": "https://www.theguardian.com/world/rss",
    "AlJazeera": "https://www.aljazeera.com/xml/rss/all.xml",
    "CNN": "http://rss.cnn.com/rss/edition.rss"
}

def fetch_from_rss(feed_url, max_items=15):
    parsed = feedparser.parse(feed_url)
    items = []
    for entry in parsed.entries[:max_items]:
        item = {
            "title": entry.get("title"),
            "link": entry.get("link"),
            "published": entry.get("published", entry.get("pubDate")),
            "summary": entry.get("summary", "")[:1000], # short preview
            "source": parsed.feed.get("title", feed_url)
        }
        items.append(item)
    return items

# Example: fetch BBC top items
items = fetch_from_rss(RSS_FEEDS["BBC"])
print(len(items), items[0])


15 {'title': 'Met Police officers sacked for gross misconduct after BBC Panorama investigation', 'link': 'https://www.bbc.com/news/articles/cy0kynx59v0o?at_medium=RSS&at_campaign=rss', 'published': 'Thu, 23 Oct 2025 15:50:08 GMT', 'summary': 'PC Philip Neilson and PC Martin Borg were dismissed following accelerated misconduct hearings after the BBC investigation.', 'source': 'BBC News'}


5. Extract full article (newspaper3k) + fallback parser

In [5]:
def extract_article(url, timeout=15):
    # Respect robots.txt
    if not is_allowed(url):
        return {"error": "blocked_by_robots", "url": url}
    try:
        art = Article(url)
        art.download()
        art.parse()
        # publish date sometimes missing — try to parse or set None
        publish_date = art.publish_date
        if publish_date is None:
            # try to fetch meta tags
            r = requests.get(url, headers=HEADERS, timeout=timeout)
            soup = BeautifulSoup(r.content, "lxml")
            # common meta tags for publication date
            meta_date = None
            for tag in ['meta[property="article:published_time"]', 'meta[name="date"]', 'meta[name="publication_date"]', 'meta[itemprop="datePublished"]']:
                m = soup.select_one(tag)
                if m and m.get("content"):
                    meta_date = m.get("content")
                    break
            if meta_date:
                try:
                    publish_date = dateparser.parse(meta_date)
                except:
                    publish_date = None
        return {
            "title": art.title,
            "authors": art.authors,
            "text": art.text,
            "top_image": art.top_image,
            "publish_date": publish_date.isoformat() if publish_date else None,
            "url": url
        }
    except Exception as e:
        # fallback: try minimal parse with BeautifulSoup to extract paragraphs
        try:
            r = requests.get(url, headers=HEADERS, timeout=timeout)
            soup = BeautifulSoup(r.content, "lxml")
            paragraphs = soup.find_all('p')
            text = "\n\n".join([p.get_text().strip() for p in paragraphs[:40]])
            # attempt to get title
            title = soup.title.string.strip() if soup.title else ""
            return {"title": title, "authors": [], "text": text, "top_image": None, "publish_date": None, "url": url}
        except Exception as e2:
            return {"error": "failed_to_download", "exception": str(e2), "url": url}


6. Full pipeline: fetch RSS → extract articles → save CSV

In [6]:
OUTPUT_FILE = "news_data.csv"

def run_pipeline(feeds, max_per_feed=10, pause_between_requests=2.0):
    # header for CSV
    fieldnames = ["scrape_time", "source", "title", "url", "publish_date", "authors", "summary", "text", "top_image"]
    with open(OUTPUT_FILE, mode='w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for source, feed in feeds.items():
            print(f"Fetching from RSS: {source}")
            items = fetch_from_rss(feed, max_items=max_per_feed)
            for it in tqdm(items):
                # polite pause
                time.sleep(pause_between_requests)
                # extract full article
                art = extract_article(it["link"])
                row = {
                    "scrape_time": datetime.utcnow().isoformat(),
                    "source": source,
                    "title": art.get("title") or it.get("title"),
                    "url": it.get("link"),
                    "publish_date": art.get("publish_date") or it.get("published"),
                    "authors": ",".join(art.get("authors") or []),
                    "summary": it.get("summary") or (art.get("text")[:300] if art.get("text") else ""),
                    "text": art.get("text") or "",
                    "top_image": art.get("top_image") or ""
                }
                writer.writerow(row)
    print("Saved to", OUTPUT_FILE)

# Run (example)
run_pipeline(RSS_FEEDS, max_per_feed=5, pause_between_requests=1.5)


Fetching from RSS: BBC


  "scrape_time": datetime.utcnow().isoformat(),
100%|██████████| 5/5 [00:10<00:00,  2.04s/it]


Fetching from RSS: Reuters


0it [00:00, ?it/s]


Fetching from RSS: TheGuardian


100%|██████████| 5/5 [00:08<00:00,  1.73s/it]


Fetching from RSS: AlJazeera


100%|██████████| 5/5 [00:08<00:00,  1.68s/it]


Fetching from RSS: CNN


100%|██████████| 5/5 [00:11<00:00,  2.22s/it]

Saved to news_data.csv





7. Save output to Google Drive (optional but recommended)

In [7]:
from google.colab import drive
drive.mount('/content/drive')

# create a folder and copy
os.makedirs('/content/drive/MyDrive/news-aggregator', exist_ok=True)
!cp news_data.csv /content/drive/MyDrive/news-aggregator/news_data_{datetime.utcnow().strftime("%Y%m%d_%H%M")}.csv
print("Copied to Drive")


Mounted at /content/drive




Copied to Drive


8. De-duplication & basic cleanup

In [8]:
import pandas as pd
import os

OUTPUT_FILE = "news_data.csv"
DEDUPED_OUTPUT_FILE = "news_data_deduped.csv"

if not os.path.exists(OUTPUT_FILE):
    print(f"Error: {OUTPUT_FILE} not found. Please run the 'run_pipeline' function first to generate the data.")
else:
    df = pd.read_csv(OUTPUT_FILE)
    # remove exact URL duplicates
    df = df.drop_duplicates(subset=['url'])
    # easy title-based dedupe (lowercase)
    df['title_lower'] = df['title'].str.lower()
    df = df.drop_duplicates(subset=['title_lower'])
    # drop the temporary lowercase title column
    df = df.drop(columns=['title_lower'])
    df.to_csv(DEDUPED_OUTPUT_FILE, index=False)
    # display top rows
    display(df.head())

Unnamed: 0,scrape_time,source,title,url,publish_date,authors,summary,text,top_image
0,2025-10-23T17:07:59.021013,BBC,Met officers sacked for gross misconduct after...,https://www.bbc.com/news/articles/cy0kynx59v0o...,"Thu, 23 Oct 2025 15:50:08 GMT",,PC Philip Neilson and PC Martin Borg were dism...,Met Police officers sacked after BBC Panorama ...,https://ichef.bbci.co.uk/news/1024/branded_new...
1,2025-10-23T17:08:01.557359,BBC,Bloody Sunday: Soldier F found not guilty of m...,https://www.bbc.com/news/articles/c993nlken18o...,"Thu, 23 Oct 2025 16:09:04 GMT",,Thirteen people were shot dead and at least 15...,Not guilty verdict for Soldier F in Bloody Sun...,https://ichef.bbci.co.uk/news/1024/branded_new...
2,2025-10-23T17:08:03.529781,BBC,PM determined to keep Phillips in job as groom...,https://www.bbc.com/news/articles/cvgwnqeq5z0o...,"Thu, 23 Oct 2025 16:27:30 GMT",,Ministers are also expecting it to be months b...,PM determined to keep Phillips in job as groom...,https://ichef.bbci.co.uk/news/1024/branded_new...
3,2025-10-23T17:08:05.556450,BBC,Tess Daly and Claudia Winkleman to leave Stric...,https://www.bbc.com/news/articles/cz0x1lr7j92o...,"Thu, 23 Oct 2025 16:16:34 GMT",,The pair have presented the show together sinc...,Tess Daly and Claudia Winkleman to leave Stric...,https://ichef.bbci.co.uk/news/1024/branded_new...
4,2025-10-23T17:08:07.360264,BBC,Israel maintaining control deeper inside Gaza ...,https://www.bbc.com/news/articles/cx2y00g4x29o...,"Thu, 23 Oct 2025 13:35:35 GMT",,Israel has placed boundary markers up to 520m ...,New images show Israeli control line deeper in...,https://ichef.bbci.co.uk/news/1024/branded_new...


9. Quick sentiment flag — very simple

In [9]:
!pip install textblob
from textblob import TextBlob
def sentiment_of_text(text):
    if not text or len(text.strip())<20:
        return None
    tb = TextBlob(text[:1000])
    return {"polarity": tb.sentiment.polarity, "subjectivity": tb.sentiment.subjectivity}

# Example use:
# df['sentiment'] = df['text'].apply(lambda t: sentiment_of_text(t)['polarity'] if sentiment_of_text(t) else None)




10. Display content

In [10]:
import pandas as pd
import os

DEDUPED_OUTPUT_FILE = "news_data_deduped.csv"

if os.path.exists(DEDUPED_OUTPUT_FILE):
    deduped_df = pd.read_csv(DEDUPED_OUTPUT_FILE)
    print(f"Displaying the full content of {DEDUPED_OUTPUT_FILE}:")
    display(deduped_df)
else:
    print(f"Error: {DEDUPED_OUTPUT_FILE} not found. Please run the data pipeline and deduplication steps first.")

Displaying the full content of news_data_deduped.csv:


Unnamed: 0,scrape_time,source,title,url,publish_date,authors,summary,text,top_image
0,2025-10-23T17:07:59.021013,BBC,Met officers sacked for gross misconduct after...,https://www.bbc.com/news/articles/cy0kynx59v0o...,"Thu, 23 Oct 2025 15:50:08 GMT",,PC Philip Neilson and PC Martin Borg were dism...,Met Police officers sacked after BBC Panorama ...,https://ichef.bbci.co.uk/news/1024/branded_new...
1,2025-10-23T17:08:01.557359,BBC,Bloody Sunday: Soldier F found not guilty of m...,https://www.bbc.com/news/articles/c993nlken18o...,"Thu, 23 Oct 2025 16:09:04 GMT",,Thirteen people were shot dead and at least 15...,Not guilty verdict for Soldier F in Bloody Sun...,https://ichef.bbci.co.uk/news/1024/branded_new...
2,2025-10-23T17:08:03.529781,BBC,PM determined to keep Phillips in job as groom...,https://www.bbc.com/news/articles/cvgwnqeq5z0o...,"Thu, 23 Oct 2025 16:27:30 GMT",,Ministers are also expecting it to be months b...,PM determined to keep Phillips in job as groom...,https://ichef.bbci.co.uk/news/1024/branded_new...
3,2025-10-23T17:08:05.556450,BBC,Tess Daly and Claudia Winkleman to leave Stric...,https://www.bbc.com/news/articles/cz0x1lr7j92o...,"Thu, 23 Oct 2025 16:16:34 GMT",,The pair have presented the show together sinc...,Tess Daly and Claudia Winkleman to leave Stric...,https://ichef.bbci.co.uk/news/1024/branded_new...
4,2025-10-23T17:08:07.360264,BBC,Israel maintaining control deeper inside Gaza ...,https://www.bbc.com/news/articles/cx2y00g4x29o...,"Thu, 23 Oct 2025 13:35:35 GMT",,Israel has placed boundary markers up to 520m ...,New images show Israeli control line deeper in...,https://ichef.bbci.co.uk/news/1024/branded_new...
5,2025-10-23T17:08:09.368420,TheGuardian,Cuban man deported from US to Eswatini goes on...,https://www.theguardian.com/world/2025/oct/22/...,2025-10-22T00:00:00,,<p>Roberto Mosquera del Peral was sent to Afri...,A Cuban man deported by the Trump administrati...,https://i.guim.co.uk/img/media/b6c36edf0f410ad...
6,2025-10-23T17:08:11.053006,TheGuardian,Anti-malaria funding cuts could lead to ‘deadl...,https://www.theguardian.com/global-development...,2025-10-21T00:00:00,Kat Lay,<p>Expected reduction in contributions by weal...,Slashed contributions from wealthy countries t...,https://i.guim.co.uk/img/media/0a9bf5a9dc9b420...
7,2025-10-23T17:08:12.843850,TheGuardian,Tensions mount as Alassane Ouattara seeks four...,https://www.theguardian.com/world/2025/oct/20/...,2025-10-20T00:00:00,Eromo Egbejule,<p>Protests have been banned and opposition fi...,"“This is worth several more terms,” the Ivoria...",https://i.guim.co.uk/img/media/c986a1325495237...
8,2025-10-23T17:08:14.517973,TheGuardian,Four dead as Kenyan security forces fire on cr...,https://www.theguardian.com/world/2025/oct/16/...,2025-10-16T00:00:00,,<p>Thousands gather in Nairobi to pay respects...,Four people have been killed in Kenya’s capita...,https://i.guim.co.uk/img/media/c7df4bd2af1dff9...
9,2025-10-23T17:08:16.194437,TheGuardian,Agnes Wanjiru’s niece urges Labour to extradit...,https://www.theguardian.com/world/2025/oct/16/...,2025-10-16T00:00:00,Hannah Al-Othman,<p>Esther Njoki says family has seen ‘big chan...,"The niece of Agnes Wanjiru, who was killed in ...",https://i.guim.co.uk/img/media/4d08201e7d7d59c...


Saving the content in a csv file

In [11]:
from google.colab import files

DEDUPED_OUTPUT_FILE = "news_data_deduped.csv"

if os.path.exists(DEDUPED_OUTPUT_FILE):
    print(f"Saving {DEDUPED_OUTPUT_FILE} for download.")
    files.download(DEDUPED_OUTPUT_FILE)
else:
    print(f"Error: {DEDUPED_OUTPUT_FILE} not found. Please run the data pipeline and deduplication steps first.")

Saving news_data_deduped.csv for download.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>