<a href="https://www.kaggle.com/code/mudasarsabir/bbc-news-rss-feed-pipeline?scriptVersionId=238204935" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# RSS Feed Data Pipeline: Extracting, Analyzing, and Storing BBC News 📡📰

Welcome to this project where we explore the powerful world of data pipelines using **RSS Feeds**. In this project, we'll focus on **BBC News**, extracting the latest articles, analyzing the data, and storing it in a structured format for future use.

---

## Project Overview 🔍

This project demonstrates how to:
1. **Extract Data**: Parse BBC News RSS feeds to retrieve the latest articles.
2. **Analyze Data**: Clean, process, and analyze the extracted data for meaningful insights.
3. **Store Data**: Save the processed data in a structured format (CSV) for easy access and future analysis.

---

### Key Features 💡
- **RSS Feed Parsing**: Using the `feedparser` library to efficiently extract article details like title, link, publication date, and category.
- **Data Processing**: Cleaning and organizing the data for further analysis using Python’s **pandas** library.
- **Storage**: Storing the cleaned data in CSV format, which can be easily shared or loaded into other applications.

---

## Workflow 📈

1. **RSS Feed Parsing**: We start by fetching the RSS feed from BBC News and parsing it into a structured format.
2. **Data Cleaning**: After extracting the data, we handle missing values, remove duplicates, and standardize the data.
3. **Analysis**: Analyze the frequency of article categories, trends in news coverage, and other insights.
4. **Data Storage**: Finally, the clean and structured data is stored in a CSV file for later use.

---

### Tools and Libraries 🛠️
- **`feedparser`**: For parsing the RSS feed.
- **`pandas`**: For data manipulation and cleaning.
- **CSV Files**: For storing the cleaned data.

---

## Conclusion 📊

By the end of this project, you will have a fully functional RSS feed data pipeline that extracts, analyzes, and stores BBC News articles, which can serve as the foundation for further data analysis, machine learning, or visualization.

Feel free to dive into the individual steps below to learn more about each part of the process!

---

> **Pro Tip**: You can extend this project to other news sources by modifying the RSS feed URL, and even build more sophisticated analysis with additional libraries like **matplotlib** or **seaborn** for data visualization. 💡


# Step 0: Install the `feedparser` Library 📦

Before we can begin parsing RSS feeds from sources like BBC News, we need to install the necessary Python library: **`feedparser`**. This library makes it easy to fetch and parse RSS feed data.

---

### Why Install `feedparser`? 🤔

- 🧩 **Simple Parsing**: `feedparser` provides a simple API to work with RSS and Atom feeds.
- ⚡ **Fast**: It allows you to quickly extract information like titles, links, and publication dates.
- 📚 **Widely Used**: `feedparser` is a reliable and widely adopted library for handling feed data in Python.

---

### Code: Installing `feedparser` with pip 🧪

If you're working in a Jupyter notebook or a Python environment, run the following command to install `feedparser`:


In [1]:
# Install the feedparser library which is used for parsing RSS and Atom feeds
# The library makes it easy to access and process web feeds in Python

!pip install feedparser

Collecting feedparser
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6047 sha256=44e2d14c8b452cb2dd02accf581b93c614e144592b96bda9d477dc07c04aefe4
  Stored in directory: /root/.cache/pip/wheels/3b/25/2a/105d6a15df6914f4d15047691c6c28f9052cc1173e40285d03
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.11 sgmllib3k-1.0.0


# Step 1: Importing Libraries for Data Manipulation and RSS Feed Parsing 🧰

In this first step, we begin by importing the essential libraries that will allow us to manipulate data and parse the RSS feed efficiently. These libraries will help us handle the RSS feed and process the extracted data in a structured format.

### Libraries 📚

- **`pandas`**: A powerful library for **data manipulation** and storing the extracted information in a structured format (such as DataFrames), which is ideal for handling large datasets.
- **`feedparser`**: A library designed for **parsing RSS feeds**, enabling easy extraction of data like article titles, links, and publication dates.
- **`urllib.parse`**: Provides **URL handling functions** and parsing utilities, useful for handling complex URLs and making sure we can work with them correctly.

---

### Code: Importing Libraries 🖥️


In [2]:
# Import libraries for data manipulation and RSS feed parsing
import pandas as pd
import feedparser
from urllib.parse import urlparse

# Step 2: RSS Feed Parsing and Data Extraction from BBC News 📰

Welcome to the **RSS Feed Parsing** and **Data Extraction** tutorial! In this notebook, we’ll walk through the process of parsing an RSS feed from **BBC News**, extracting valuable data such as titles, links, and publication dates, and saving this information into a **structured CSV file** for future analysis or reporting. 🌟

---

## Overview 📊

Our main objectives in this notebook are:
1. **🚀 Parse an RSS feed** using Python's powerful `feedparser` library.
2. **🔍 Extract important data fields** from the feed like titles, links, publication dates, and categories.
3. **💾 Save the extracted data** in a CSV file for easy access and analysis.

---

## Parsing the RSS Feed 🔍

Now, let’s begin by parsing the RSS feed from **BBC News**. Using the `feedparser` library, we will easily fetch and parse the data, giving us structured access to the headlines, publication dates, and more.

### Step-by-step Procedure 📋

1. **Import the necessary libraries** 🔧
2. **Define the URL** of the **BBC News RSS feed** 🌐
3. **Parse the RSS feed** with the `feedparser.parse()` method 📥


In [3]:
# Step 1: Parse the RSS feed
rss_url = "https://feeds.bbci.co.uk/news/rss.xml"
feed = feedparser.parse(rss_url)

# Step 2: Extract relevant data
data = []
for entry in feed.entries:
    # Extract section from URL (e.g., 'sport', 'news', etc.)
    path_parts = urlparse(entry.link).path.strip("/").split("/")
    url_section = path_parts[0] if path_parts else None

    data.append({
        "title": entry.title,
        "link": entry.link,
        "published": entry.published,
        "summary": entry.summary,
        "category": url_section  # Extracted from URL
    })

# Step 3: Create a DataFrame
new_data = pd.DataFrame(data)

# Step 4: Save to CSV
new_data.to_csv("new_bbc_news_feed.csv", index=False)

print("Data saved to new_bbc_news_feed.csv")


Data saved to new_bbc_news_feed.csv


# Step 3: Accessing the DataFrame Head 🔑

In this step, we will inspect the **first few rows** of our DataFrame named `new_data`. We'll use the `head()` function to quickly view the top rows of the dataset, which helps us understand its structure before accessing specific columns like **"category"**.

---

### Why Use the `head()` Function? 🤔

The `head()` function in Pandas is an easy way to preview the first few rows of a DataFrame. It's particularly useful when you want to:

- 🧐 **Inspect the Structure**: Get a sense of the columns and data types.
- 🔍 **Verify Data Loading**: Confirm that the dataset has been loaded correctly.
- 📊 **Quick Exploration**: View a snapshot of the data to plan next steps.

**Advantages**:
- 💡 **Quick Preview**: See the top rows instantly without loading the entire dataset.
- ⚡ **Efficient**: Ideal for large datasets where viewing the entire DataFrame might not be feasible.
- ✅ **Helps in Debugging**: Quickly check for missing data or inconsistencies.

---

### Code: Accessing the DataFrame Head 📋

To view the first few rows of the `new_data` DataFrame, use the following code:


In [4]:
# Display the first five rows of the DataFrame for inspection
new_data.head()

Unnamed: 0,title,link,published,summary,category
0,UK and India agree trade deal after three year...,https://www.bbc.com/news/articles/c5y6y90e5vzo,"Tue, 06 May 2025 16:25:43 GMT",The deal will improve access for UK whisky and...,news
1,Labour MPs' rage over election results simmers on,https://www.bbc.com/news/articles/ckg1rgr25e7o,"Tue, 06 May 2025 13:52:00 GMT",Anger over the party's drubbing in last week's...,news
2,Sycamore Gap accused thought it was 'just a tree',https://www.bbc.com/news/articles/c4g2g57vjl8o,"Tue, 06 May 2025 15:34:53 GMT",Adam Carruthers says he did not fell the tree ...,news
3,'No food when I gave birth': Malnutrition rise...,https://www.bbc.com/news/articles/czrv5rl73zdo,"Tue, 06 May 2025 16:10:40 GMT","Five-month-old Siwar can barely cry, her voice...",news
4,King and Queen unveil Coronation portraits,https://www.bbc.com/news/articles/cd020z0dl2eo,"Tue, 06 May 2025 14:55:53 GMT",The two portraits will be on display at the ga...,news


# Step 4: Counting Non-Null Values in the DataFrame 📊

In this step, we will use the `count()` function to determine the number of **non-null values** in each column of our `new_data` DataFrame. This helps us to assess the completeness of our dataset and identify any missing or null values.

---

### Why Use `count()`? 🤔

The `count()` function is a simple and efficient way to get the number of valid (non-null) entries in each column. It allows us to:

- 📊 **Verify Data Integrity**: Ensure that we have no missing or null values in critical columns.
- 🕵️‍♂️ **Identify Missing Data**: Quickly spot columns that may have missing or incomplete data.
- ⚡ **Efficient Check**: Rather than inspecting the entire dataset, `count()` gives us a quick summary of the data.

---

### Code: Using `count()` to Check Non-Null Values 📋

To count the non-null values in each column of the `new_data` DataFrame, use the following code:


In [5]:
# Count the number of non-null values in each column of the new_data DataFrame
# This provides a quick overview of data completeness across all columns
new_data.count()

title        34
link         34
published    34
summary      34
category     34
dtype: int64

# Step 5: Loading and Copying the Dataset 📥

In this step, we will load the dataset from a CSV file into a **Pandas DataFrame** and create a deep copy of the DataFrame to preserve the original data. This allows us to manipulate the data without altering the original dataset.

---

### Why Create a Deep Copy? 🤔

A deep copy creates an independent copy of the DataFrame, ensuring that any changes made to the new DataFrame do not affect the original data. This is useful for maintaining the integrity of the dataset throughout the analysis process.

**Advantages**:
- 🛡️ **Data Preservation**: The original dataset remains intact, which is crucial for reproducibility.
- 🔄 **Safe Modifications**: Any modifications made to the new DataFrame won’t impact the original data.
- 💡 **Versatility**: You can experiment freely with the copied data without worrying about losing the original.

---

### Code: Loading the Dataset and Creating a Deep Copy 📋

To load the CSV file into a DataFrame and create a deep copy, use the following code:


In [6]:
# Load the dataset from CSV file into a pandas DataFrame
data = pd.read_csv('/kaggle/input/old-bbc-news-feed/old_bbc_news_feed.csv')
# Create a deep copy of the DataFrame to preserve the original data
old_data = data.copy()

# Step 6: Viewing the First Few Rows of the DataFrame 🔍

In this step, we will view the first few rows of the `old_data` DataFrame using the `head()` function. This allows us to quickly inspect the dataset and get a sense of its structure and contents.

---

### Why Use `head()`? 🤔

The `head()` function is a useful tool for:
- 🧐 **Previewing the Data**: It provides a snapshot of the first few rows, helping you understand the dataset's structure without loading the entire data.
- 🧑‍💻 **Exploratory Data Analysis (EDA)**: It helps you start your analysis by checking the initial records of the dataset.
- ⚡ **Quick Verification**: Verifying that the data was loaded correctly and that it contains the expected columns and values.

By default, `head()` returns the first **5 rows** of the DataFrame, but you can specify the number of rows to display.

---

### Code: Viewing the First Few Rows of the `old_data` DataFrame 📋

To view the first 5 rows of the `old_data` DataFrame, use the following code:


In [7]:
# Display the first five rows of the old_data DataFrame to inspect its structure and content
old_data.head()

Unnamed: 0,title,link,published,summary,category
0,UK and India agree trade deal after three year...,https://www.bbc.com/news/articles/c5y6y90e5vzo,"Tue, 06 May 2025 16:25:43 GMT",The deal will improve access for UK whisky and...,news
1,Labour MPs' rage over election results simmers on,https://www.bbc.com/news/articles/ckg1rgr25e7o,"Tue, 06 May 2025 13:52:00 GMT",Anger over the party's drubbing in last week's...,news
2,Sycamore Gap accused thought it was 'just a tree',https://www.bbc.com/news/articles/c4g2g57vjl8o,"Tue, 06 May 2025 15:34:53 GMT",Adam Carruthers says he did not fell the tree ...,news
3,'No food when I gave birth': Malnutrition rise...,https://www.bbc.com/news/articles/czrv5rl73zdo,"Tue, 06 May 2025 16:10:40 GMT","Five-month-old Siwar can barely cry, her voice...",news
4,King and Queen unveil Coronation portraits,https://www.bbc.com/news/articles/cd020z0dl2eo,"Tue, 06 May 2025 14:55:53 GMT",The two portraits will be on display at the ga...,news


# Step 7: Counting Non-Null Values in the DataFrame 📊

In this step, we will use the `count()` function to determine the number of **non-null values** in each column of the `old_data` DataFrame. This helps us assess the completeness of the dataset and identify any missing values.

---

### Why Use `count()`? 🤔

The `count()` function is a simple yet powerful tool for:
- 🕵️‍♂️ **Identifying Missing Data**: It counts only non-null (valid) entries in each column, helping us spot any missing data.
- 📊 **Verifying Data Integrity**: Ensures that each column contains valid, usable data before starting deeper analysis.
- 🔄 **Data Exploration**: Provides insight into how many records are available for each column, especially useful for large datasets.

By default, `count()` returns the number of **non-null values** for each column in the DataFrame.

---

### Code: Counting Non-Null Values in the `old_data` DataFrame 📋

To count the number of non-null values in each column of `old_data`, use the following code:



In [8]:
# Count the number of non-null values in each column of the old_data DataFrame
# This provides a quick overview of data completeness
old_data.count()

title        34
link         34
published    34
summary      34
category     34
dtype: int64

# Step 8: Combining Old and New Data 📦➕🗃️

In this step, we combine the previously loaded **old data** with the newly extracted **BBC RSS feed data**. This is essential for maintaining a consolidated and up-to-date dataset. After combining, we ensure the dataset remains clean by removing duplicate entries based on the **link** field.

---

### Why Combine Datasets? 🤔

- 🔁 **Incremental Updates**: RSS feeds often provide only the latest entries. Merging with historical data preserves the full record.
- 🧩 **Unified Dataset**: Combining old and new data provides a single, comprehensive DataFrame for analysis.
- 🧼 **Data Integrity**: Removing duplicates prevents repeated entries in visualizations, summaries, or machine learning inputs.

---

### Code: Merge and Deduplicate DataFrames 🧪


In [9]:
# Combine old and new data
combined_data = pd.concat([old_data, new_data], ignore_index=True)

# Remove duplicates (e.g., by title or link)
combined_data = combined_data.drop_duplicates(subset="link")


# Step 9: Previewing the Combined Dataset 🧾🔍

After merging the old and new BBC News data, it’s important to preview the result to ensure that the combination and deduplication steps were successful. We use the `head()` function to view the first few rows of the `combined_data` DataFrame.

---

### Why Preview the Data? 🤔

- 🧠 **Quick Sanity Check**: Verify that the data looks as expected after combining.
- ✅ **Validate Deduplication**: Ensure duplicate articles (e.g., same links) were removed properly.
- 🔍 **Inspect Structure**: Confirm column names and data formats are consistent.

---

### Code: View the Top Rows of Combined Data 🧪


In [10]:
combined_data.head()

Unnamed: 0,title,link,published,summary,category
0,UK and India agree trade deal after three year...,https://www.bbc.com/news/articles/c5y6y90e5vzo,"Tue, 06 May 2025 16:25:43 GMT",The deal will improve access for UK whisky and...,news
1,Labour MPs' rage over election results simmers on,https://www.bbc.com/news/articles/ckg1rgr25e7o,"Tue, 06 May 2025 13:52:00 GMT",Anger over the party's drubbing in last week's...,news
2,Sycamore Gap accused thought it was 'just a tree',https://www.bbc.com/news/articles/c4g2g57vjl8o,"Tue, 06 May 2025 15:34:53 GMT",Adam Carruthers says he did not fell the tree ...,news
3,'No food when I gave birth': Malnutrition rise...,https://www.bbc.com/news/articles/czrv5rl73zdo,"Tue, 06 May 2025 16:10:40 GMT","Five-month-old Siwar can barely cry, her voice...",news
4,King and Queen unveil Coronation portraits,https://www.bbc.com/news/articles/cd020z0dl2eo,"Tue, 06 May 2025 14:55:53 GMT",The two portraits will be on display at the ga...,news


# Step 10: Counting Non-Null Values in the Combined Dataset 📊

After merging and cleaning the dataset, it’s crucial to verify that all columns contain the expected number of valid (non-null) values. We use the `count()` function to count the number of **non-null** values in each column of the `combined_data` DataFrame.

---

### Why Use `count()`? 🤔

- 🧑‍💻 **Verify Data Integrity**: Ensure that the dataset has no missing or incomplete values.
- 📈 **Understand the Dataset**: Confirm that each column contains a full set of valid data for analysis.
- 🧹 **Pre-Analysis Cleanliness**: Helps in identifying columns with missing values that may need to be addressed before further analysis.

---

### Code: Count Non-Null Values in the Combined Data 🧪


In [11]:
combined_data.count()

title        34
link         34
published    34
summary      34
category     34
dtype: int64

# Step 11: Saving the Combined Data to a CSV File 💾

Now that we've successfully combined and cleaned the BBC News data, it’s time to save the `combined_data` DataFrame to a CSV file. This allows us to preserve the data for future use, analysis, or sharing.

---

### Why Save to CSV? 🤔

- 📦 **Persistent Storage**: Storing data in a CSV file allows us to easily access and use it in the future without having to reload and reprocess the original data.
- 📈 **Interoperability**: CSV files are widely supported by many tools and programming languages, making them easy to share and use in various applications.
- 🔒 **Data Safety**: Saving the dataset ensures that you don't lose the work done up to this point in case the environment is cleared or the data is lost.

---

### Code: Saving the Combined Data 🧪


In [12]:
# Save the submission DataFrame to a CSV file without the index
combined_data.to_csv('old_bbc_news_feed.csv', index=False)

# Confirm the file has been saved (optional)
print("Combined Data file has been saved as 'old_bbc_news_feed.csv'")


Combined Data file has been saved as 'old_bbc_news_feed.csv'


# 👋 Muhammad Muhammad Mudasar Sabir

I’m a Machine Learning and Deep Learning enthusiast with a strong interest in Computer Vision and Generative AI. I enjoy solving real-world problems using intelligent, data-driven approaches. My focus areas include:

- 🤖 Machine Learning & Deep Learning  
- 🧠 Computer Vision & Generative Models  
- 📊 Data Analysis & Feature Engineering  
- 🚀 Model Evaluation & Deployment  

---

## 🔗 Connect with Me

- 🧠 **Kaggle**: [https://www.kaggle.com/mudasarsabir](https://www.kaggle.com/mudasarsabir)  
- 💻 **GitHub**: [https://github.com/mudasarsabir](https://github.com/mudasarsabir)  
- 🔗 **LinkedIn**: [https://www.linkedin.com/in/mudasarsabir](https://www.linkedin.com/in/mudasarsabir)  
- 🌐 **Portfolio**: [https://muddasarsabir.netlify.app](https://muddasarsabir.netlify.app)

---

## 📌 Featured Project

### 🎯 Titanic Survival Prediction - Decision Tree
- GitHub: [View Repository](https://github.com/mudasarsabir)  
- Kaggle: [View Notebook](https://www.kaggle.com/mudasarsabir)  

A beginner-friendly ML project applying Decision Tree classification to predict Titanic passenger survival, including EDA, preprocessing, and model evaluation.

---

## 📬 Get in Touch

I'm open to collaboration, research, and AI-focused opportunities. Let’s connect!
