# BBC News Dataset - Data Exploration

This notebook provides comprehensive exploration of the BBC News dataset subset that we'll use for our KG-Enhanced RAG system.

## Objectives:
1. Load and examine the dataset structure
2. Analyze category distribution
3. Explore text length statistics
4. Visualize data characteristics
5. Display sample articles from each category
6. Assess data quality for RAG implementation

---

In [1]:
# Import Required Libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path
import sys
from collections import Counter, defaultdict

# Add src to path for imports
sys.path.append('../src')
from config import config

# Set up plotting style
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("📚 Libraries imported successfully!")
print(f"📁 Project root: {config.PROJECT_ROOT}")
print(f"📊 Expected dataset size: {config.DATASET_SIZE} articles")

📚 Libraries imported successfully!
📁 Project root: /Users/jaig/kgrag/notebooks/..
📊 Expected dataset size: 50 articles


In [2]:
# Load the BBC News Dataset
dataset_path = config.get_bbc_subset_path()

print(f"📂 Loading dataset from: {dataset_path}")

try:
    with open(dataset_path, 'r', encoding='utf-8') as f:
        dataset_json = json.load(f)
    
    # Extract metadata and articles
    metadata = dataset_json['dataset_info']
    articles = dataset_json['articles']
    
    print("✅ Dataset loaded successfully!")
    print(f"📊 Total articles: {len(articles)}")
    print(f"📅 Created: {metadata['creation_date']}")
    print(f"🏷️ Categories: {metadata['categories']}")
    
except FileNotFoundError:
    print("❌ Dataset file not found!")
    print("💡 Please run the dataset loader script first:")
    print("   python -m src.ingestion.dataset_loader")
    sys.exit(1)
except Exception as e:
    print(f"❌ Error loading dataset: {e}")
    sys.exit(1)

📂 Loading dataset from: /Users/jaig/kgrag/notebooks/../data/raw/bbc_news_subset.json
✅ Dataset loaded successfully!
📊 Total articles: 50
📅 Created: 2025-09-29T16:20:09.324964
🏷️ Categories: ['business', 'entertainment', 'politics', 'sport', 'tech']


In [3]:
# Convert to DataFrame for easier analysis
df = pd.DataFrame(articles)

print("📋 Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\n🔍 Data Types:")
print(df.dtypes)
print("\n📊 First few rows:")
df.head()

📋 Dataset Overview:
Shape: (50, 6)
Columns: ['original_index', 'category', 'text', 'label', 'label_text', 'article_id']

🔍 Data Types:
original_index     int64
category          object
text              object
label              int64
label_text        object
article_id        object
dtype: object

📊 First few rows:


Unnamed: 0,original_index,category,text,label,label_text,article_id
0,284,business,france telecom gets orange boost strong growth...,1,business,bbc_business_00
1,61,business,fannie mae should restate books us mortgage ...,1,business,bbc_business_01
2,643,business,mexican in us send $16bn home mexican labourer...,1,business,bbc_business_02
3,593,business,christmas sales worst since 1981 uk retail sal...,1,business,bbc_business_03
4,535,business,japan s ageing workforce: built to last in his...,1,business,bbc_business_04


## Category Distribution Analysis

Let's examine how articles are distributed across different BBC News categories.

In [4]:
# Category Distribution Analysis
category_counts = df['category'].value_counts().sort_index()

print("📊 Category Distribution:")
for category, count in category_counts.items():
    print(f"  {category.capitalize()}: {count} articles")

# Create interactive bar chart
fig = px.bar(
    x=category_counts.index, 
    y=category_counts.values,
    title="BBC News Articles Distribution by Category",
    labels={'x': 'Category', 'y': 'Number of Articles'},
    color=category_counts.values,
    color_continuous_scale='viridis'
)

fig.update_layout(
    xaxis_title="Category",
    yaxis_title="Number of Articles",
    showlegend=False,
    height=400
)

fig.show()

📊 Category Distribution:
  Business: 10 articles
  Entertainment: 10 articles
  Politics: 10 articles
  Sport: 10 articles
  Tech: 10 articles


## Text Length Analysis

Understanding the length characteristics of our articles is crucial for chunking strategy in RAG systems.

In [5]:
# Calculate text statistics
df['text_length_chars'] = df['text'].str.len()
df['text_length_words'] = df['text'].str.split().str.len()

# Calculate statistics by category
stats_by_category = df.groupby('category').agg({
    'text_length_chars': ['mean', 'std', 'min', 'max'],
    'text_length_words': ['mean', 'std', 'min', 'max']
}).round(0)

print("📏 Text Length Statistics by Category:")
print(stats_by_category)

📏 Text Length Statistics by Category:
              text_length_chars                     text_length_words         \
                           mean     std   min   max              mean    std   
category                                                                       
business                 2453.0  1268.0  1325  5355             408.0  219.0   
entertainment            1903.0   568.0  1177  2929             340.0  107.0   
politics                 2662.0   612.0  1705  3632             458.0  111.0   
sport                    2312.0   900.0   789  3414             403.0  164.0   
tech                     2546.0  1056.0  1544  5182             432.0  192.0   

                         
               min  max  
category                 
business       215  912  
entertainment  198  531  
politics       280  638  
sport          132  632  
tech           265  918  


In [6]:
# Overall text statistics
overall_stats = {
    'Characters': {
        'Mean': df['text_length_chars'].mean(),
        'Median': df['text_length_chars'].median(),
        'Std': df['text_length_chars'].std(),
        'Min': df['text_length_chars'].min(),
        'Max': df['text_length_chars'].max()
    },
    'Words': {
        'Mean': df['text_length_words'].mean(),
        'Median': df['text_length_words'].median(),
        'Std': df['text_length_words'].std(),
        'Min': df['text_length_words'].min(),
        'Max': df['text_length_words'].max()
    }
}

print("📈 Overall Text Statistics:")
for metric_type, stats in overall_stats.items():
    print(f"\n{metric_type}:")
    for stat_name, value in stats.items():
        print(f"  {stat_name}: {value:.0f}")

📈 Overall Text Statistics:

Characters:
  Mean: 2375
  Median: 2224
  Std: 920
  Min: 789
  Max: 5355

Words:
  Mean: 408
  Median: 380
  Std: 163
  Min: 132
  Max: 918


In [7]:
# Create text length distribution visualizations
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Character Count Distribution', 'Character Count by Category',
                   'Word Count Distribution', 'Word Count by Category'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# Character count histogram
fig.add_trace(
    go.Histogram(x=df['text_length_chars'], name='Character Count', nbinsx=20),
    row=1, col=1
)

# Character count box plot by category
for category in df['category'].unique():
    category_data = df[df['category'] == category]['text_length_chars']
    fig.add_trace(
        go.Box(y=category_data, name=category.capitalize(), showlegend=False),
        row=1, col=2
    )

# Word count histogram
fig.add_trace(
    go.Histogram(x=df['text_length_words'], name='Word Count', nbinsx=20),
    row=2, col=1
)

# Word count box plot by category
for category in df['category'].unique():
    category_data = df[df['category'] == category]['text_length_words']
    fig.add_trace(
        go.Box(y=category_data, name=category.capitalize(), showlegend=False),
        row=2, col=2
    )

fig.update_layout(height=800, title_text="Text Length Analysis")
fig.show()

## Sample Articles Display

Let's examine sample articles from each category to understand content quality and diversity.

In [8]:
# Display sample articles from each category
def display_article_sample(category, max_chars=300):
    """Display a sample article from the specified category."""
    category_articles = df[df['category'] == category]
    
    if len(category_articles) == 0:
        print(f"No articles found for category: {category}")
        return
    
    # Get first article from category
    sample = category_articles.iloc[0]
    
    print(f"\n🏷️ Category: {category.upper()}")
    print(f"📄 Article ID: {sample['article_id']}")
    print(f"📊 Length: {sample['text_length_chars']} chars, {sample['text_length_words']} words")
    print(f"📝 Text Preview:")
    
    preview_text = sample['text'][:max_chars]
    if len(sample['text']) > max_chars:
        preview_text += "..."
    
    # Format text for better readability
    formatted_text = preview_text.replace('\n', ' ').replace('  ', ' ')
    print(f"   {formatted_text}")
    print("-" * 80)

print("📄 Sample Articles from Each Category")
print("=" * 80)

for category in sorted(df['category'].unique()):
    display_article_sample(category)

📄 Sample Articles from Each Category

🏷️ Category: BUSINESS
📄 Article ID: bbc_business_00
📊 Length: 1787 chars, 291 words
📝 Text Preview:
   france telecom gets orange boost strong growth in subscriptions to mobile phone network orange has helped boost profits at owner france telecom. orange added more than five million new customers in 2004 leading to a 10% increase in its revenues. increased take-up of broadband telecoms services als...
--------------------------------------------------------------------------------

🏷️ Category: ENTERTAINMENT
📄 Article ID: bbc_entertainment_00
📊 Length: 1445 chars, 246 words
📝 Text Preview:
   snow patrol feted at irish awards snow patrol were the big winners in ireland s top music honours the meteor awards picking up accolades for best irish band and album on thursday. the belfast-born glasgow-based band collected the prizes at the ceremony at dublin s point theatre. westlife won the...
--------------------------------------------------------------

## Data Quality Assessment

Let's assess the quality of our dataset for RAG implementation.

In [9]:
# Data Quality Assessment
print("🔍 Data Quality Assessment")
print("=" * 50)

# Check for missing values
missing_data = df.isnull().sum()
print("❓ Missing Values:")
for col, count in missing_data.items():
    if count > 0:
        print(f"  {col}: {count}")
    else:
        print(f"  {col}: ✅ No missing values")

# Check for duplicate articles
duplicate_texts = df['text'].duplicated().sum()
print(f"\n🔄 Duplicate Articles: {duplicate_texts}")

# Check for very short or very long articles
short_articles = len(df[df['text_length_words'] < 50])
long_articles = len(df[df['text_length_words'] > 1000])

print(f"\n📏 Article Length Distribution:")
print(f"  Very short articles (<50 words): {short_articles}")
print(f"  Very long articles (>1000 words): {long_articles}")
print(f"  Normal length articles: {len(df) - short_articles - long_articles}")

# Check category balance
category_balance = df['category'].value_counts()
is_balanced = len(set(category_balance.values)) == 1
print(f"\n⚖️ Category Balance: {'✅ Perfectly balanced' if is_balanced else '⚠️ Imbalanced'}")

# Chunking analysis based on config
chunk_size = config.CHUNK_SIZE
print(f"\n✂️ Chunking Analysis (chunk size: {chunk_size} chars):")
articles_needing_chunking = len(df[df['text_length_chars'] > chunk_size])
avg_chunks_per_article = df['text_length_chars'].apply(lambda x: max(1, x // chunk_size)).mean()

print(f"  Articles needing chunking: {articles_needing_chunking}/{len(df)} ({articles_needing_chunking/len(df)*100:.1f}%)")
print(f"  Average chunks per article: {avg_chunks_per_article:.1f}")

🔍 Data Quality Assessment
❓ Missing Values:
  original_index: ✅ No missing values
  category: ✅ No missing values
  text: ✅ No missing values
  label: ✅ No missing values
  label_text: ✅ No missing values
  article_id: ✅ No missing values
  text_length_chars: ✅ No missing values
  text_length_words: ✅ No missing values

🔄 Duplicate Articles: 0

📏 Article Length Distribution:
  Very short articles (<50 words): 0
  Very long articles (>1000 words): 0
  Normal length articles: 50

⚖️ Category Balance: ✅ Perfectly balanced

✂️ Chunking Analysis (chunk size: 800 chars):
  Articles needing chunking: 49/50 (98.0%)
  Average chunks per article: 2.4


In [10]:
# Create a comprehensive summary visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Articles per Category', 'Text Length Distribution', 
                   'Character vs Word Count', 'Category Word Count Comparison'),
    specs=[[{"type": "bar"}, {"type": "histogram"}],
           [{"type": "scatter"}, {"type": "box"}]]
)

# 1. Articles per category
fig.add_trace(
    go.Bar(x=category_counts.index, y=category_counts.values, name='Article Count'),
    row=1, col=1
)

# 2. Text length distribution
fig.add_trace(
    go.Histogram(x=df['text_length_chars'], name='Character Count Distribution'),
    row=1, col=2
)

# 3. Character vs Word count scatter
fig.add_trace(
    go.Scatter(
        x=df['text_length_words'], 
        y=df['text_length_chars'],
        mode='markers',
        text=df['category'],
        name='Articles',
        marker=dict(color=df['category'].astype('category').cat.codes)
    ),
    row=2, col=1
)

# 4. Word count by category
for i, category in enumerate(df['category'].unique()):
    category_data = df[df['category'] == category]['text_length_words']
    fig.add_trace(
        go.Box(y=category_data, name=category, showlegend=False),
        row=2, col=2
    )

fig.update_layout(height=800, title_text="BBC News Dataset - Comprehensive Analysis")
fig.update_xaxes(title_text="Category", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_xaxes(title_text="Character Count", row=1, col=2)
fig.update_xaxes(title_text="Word Count", row=2, col=1)
fig.update_yaxes(title_text="Character Count", row=2, col=1)
fig.update_xaxes(title_text="Category", row=2, col=2)
fig.update_yaxes(title_text="Word Count", row=2, col=2)

fig.show()

## Key Insights for RAG Implementation

Based on our data exploration, here are the key findings:

In [11]:
# Generate insights for RAG implementation
print("🎯 Key Insights for KG-Enhanced RAG Implementation")
print("=" * 60)

print("\n✅ Dataset Quality:")
print("  • Balanced distribution across all 5 categories")
print("  • No missing or duplicate articles")
print("  • Good text quality and diversity")

print(f"\n📊 Content Characteristics:")
print(f"  • Average article length: {df['text_length_chars'].mean():.0f} characters")
print(f"  • Average word count: {df['text_length_words'].mean():.0f} words")
print(f"  • Length range: {df['text_length_chars'].min():.0f} - {df['text_length_chars'].max():.0f} characters")

chunk_analysis = df['text_length_chars'] > config.CHUNK_SIZE
chunking_needed = chunk_analysis.sum()

print(f"\n✂️ Chunking Strategy:")
print(f"  • Chunk size: {config.CHUNK_SIZE} characters")
print(f"  • Articles needing chunking: {chunking_needed}/{len(df)} ({chunking_needed/len(df)*100:.1f}%)")
print(f"  • Overlap: {config.CHUNK_OVERLAP} characters")

print(f"\n🏷️ Categories for Knowledge Graph:")
categories = df['category'].unique()
print(f"  • {len(categories)} distinct categories: {', '.join(categories)}")
print(f"  • {config.ARTICLES_PER_CATEGORY} articles per category")

print(f"\n💡 Recommendations:")
print("  • Current chunk size is appropriate for most articles")
print("  • Categories provide good diversity for knowledge graph construction")
print("  • Text quality is suitable for embedding generation")
print("  • Dataset size is manageable for development and testing")

print(f"\n🎉 Dataset is ready for KG-Enhanced RAG implementation!")

🎯 Key Insights for KG-Enhanced RAG Implementation

✅ Dataset Quality:
  • Balanced distribution across all 5 categories
  • No missing or duplicate articles
  • Good text quality and diversity

📊 Content Characteristics:
  • Average article length: 2375 characters
  • Average word count: 408 words
  • Length range: 789 - 5355 characters

✂️ Chunking Strategy:
  • Chunk size: 800 characters
  • Articles needing chunking: 49/50 (98.0%)
  • Overlap: 100 characters

🏷️ Categories for Knowledge Graph:
  • 5 distinct categories: business, entertainment, politics, sport, tech
  • 10 articles per category

💡 Recommendations:
  • Current chunk size is appropriate for most articles
  • Categories provide good diversity for knowledge graph construction
  • Text quality is suitable for embedding generation
  • Dataset size is manageable for development and testing

🎉 Dataset is ready for KG-Enhanced RAG implementation!
