# Search for "word" in Dutch Articles Dataset

This notebook searches for the specific word in the Dutch news articles dataset and displays any matches with context.

## 1. Import Required Libraries

First, let's import the necessary libraries for loading and searching through the dataset.

In [6]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load the Dataset

Load the Dutch articles dataset from the feather file and examine its structure.

In [7]:
# Load the dataset
dataset_path = Path("data/NOS_NL_articles_2015_mar_2025.feather")
print(f"Loading dataset from: {dataset_path}")

if dataset_path.exists():
    df = pd.read_feather(dataset_path)
    print(f"Dataset loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    
    # Display basic info about the dataset
    print("\nDataset Info:")
    df.info()
    
    # Show first few rows
    print("\nFirst 3 rows:")
    display(df.head(3))
else:
    print(f"Error: Dataset file not found at {dataset_path}")
    print("Please check if the file exists in the correct location.")

Loading dataset from: data\NOS_NL_articles_2015_mar_2025.feather
Dataset loaded successfully!
Shape: (295259, 11)
Columns: ['channel', 'url', 'type', 'title', 'keywords', 'section', 'description', 'published_time', 'modified_time', 'image', 'content']

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 295259 entries, 1948 to 1932
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   channel         295259 non-null  object        
 1   url             295259 non-null  object        
 2   type            295259 non-null  object        
 3   title           295259 non-null  object        
 4   keywords        279786 non-null  object        
 5   section         289735 non-null  object        
 6   description     295259 non-null  object        
 7   published_time  295259 non-null  datetime64[ns]
 8   modified_time   295259 non-null  object        
 9   image           295158 non-null  objec

Unnamed: 0,channel,url,type,title,keywords,section,description,published_time,modified_time,image,content
1948,nos,https://nos.nl/artikel/2011341-euro-nu-ook-in-...,article,Euro nu ook in Litouwen,eurozone,Economie,Vanaf vandaag betalen ze in Litouwen met de eu...,2015-01-01 00:32:52,2015-01-01 00:32:52,https://cdn.nos.nl/image/2015/01/01/48809/1200...,<h1>Euro nu ook in Litouwen</h1><p>In Litouwen...
1949,nos,https://nos.nl/artikel/2011343-start-2015-vol-...,article,Start 2015 vol vreugde maar ook met gewonden e...,oud en nieuw,Binnenland,Nederland is met oliebollen en vuurwerk het ni...,2015-01-01 01:05:57,2015-01-01 07:18:23,https://cdn.nos.nl/image/2015/01/01/48853/1200...,<h1>Start 2015 vol vreugde maar ook met gewond...
1950,nos,https://nos.nl/artikel/2011346-letland-nieuwe-...,article,Letland nieuwe voorzitter van de Europese Unie,"EU-voorzitter, Italië, EU, Letland",Buitenland,Vanaf vandaag neemt Letland het stokje over va...,2015-01-01 02:32:34,2015-01-01 02:32:34,https://cdn.nos.nl/image/2015/01/01/48818/1200...,<h1>Letland nieuwe voorzitter van de Europese ...


## 3. Search for the Specific Word

Now let's search for the word in the dataset across all text columns.

In [11]:
# Word to search for
search_word = "bloedneus"
print(f"Searching for the word: '{search_word}'")

# Initialize results
matches = []
total_count = 0

# Search through all text columns (assuming there are columns like 'title', 'text', 'content', etc.)
text_columns = [col for col in df.columns if df[col].dtype == 'object']
print(f"Searching in columns: {text_columns}")

for col in text_columns:
    # Convert to string and handle NaN values
    col_data = df[col].fillna('').astype(str)
    
    # Case-insensitive search
    mask = col_data.str.contains(search_word, case=False, na=False)
    
    if mask.any():
        matches_in_col = df[mask].copy()
        matches_in_col['found_in_column'] = col
        matches.append(matches_in_col)
        
        count_in_col = mask.sum()
        total_count += count_in_col
        print(f"Found {count_in_col} matches in column '{col}'")

print(f"\nTotal matches found: {total_count}")

Searching for the word: 'bloedneus'
Searching in columns: ['channel', 'url', 'type', 'title', 'keywords', 'section', 'description', 'modified_time', 'image', 'content']
Found 5 matches in column 'url'
Found 5 matches in column 'title'
Found 2 matches in column 'description'
Found 65 matches in column 'content'

Total matches found: 77


## 4. Display Search Results

Show the entries where the word was found, including context and relevant metadata.

In [14]:
if matches:
    # Combine all matches
    all_matches = pd.concat(matches, ignore_index=True)
    
    # Sort by published_time (year) - most recent first
    all_matches = all_matches.sort_values('published_time', ascending=False)
    
    print(f"Found {len(all_matches)} entries containing '{search_word}' (sorted by year, newest first):")
    print("=" * 60)
    
    for idx, (_, row) in enumerate(all_matches.iterrows()):
        print(f"\n--- Match {idx + 1} ---")
        print(f"Found in column: {row['found_in_column']}")
        print(f"Published: {row['published_time']}")
        print(f"Year: {row['published_time'].year}")
        
        # Display available metadata columns
        for col in df.columns:
            if col != row['found_in_column'] and col != 'published_time' and not pd.isna(row[col]) and str(row[col]).strip():
                print(f"{col}: {str(row[col])[:200]}{'...' if len(str(row[col])) > 200 else ''}")
        
        # Show the context around the word in the column where it was found
        text_content = str(row[row['found_in_column']])
        if search_word.lower() in text_content.lower():
            # Find the position of the word and show context
            word_pos = text_content.lower().find(search_word.lower())
            context_start = max(0, word_pos - 100)
            context_end = min(len(text_content), word_pos + len(search_word) + 100)
            context = text_content[context_start:context_end]
            
            print(f"\nContext in '{row['found_in_column']}':")
            print(f"...{context}...")
        
        print("\n" + "=" * 60)
else:
    print(f"No matches found for the word '{search_word}' in the dataset.")

Found 77 entries containing 'bloedneus' (sorted by year, newest first):

--- Match 1 ---
Found in column: content
Published: 2025-01-30 18:08:34
Year: 2025
channel: nos
url: https://nos.nl/l/2553883
type: liveblog
title: Drie Nederlandse clubs naar tussenronde Europa League • AZ ontloopt met 4-3 Ajax als opponent
keywords: FC Twente, Ajax, AZ, europa league
description: In dit liveblog volgden we alles rondom de ontknoping van de competitiefase in de Europa League. 
modified_time: 2025-01-31 00:45:56
image: https://cdn.nos.nl/image/2025/01/30/1185870/1024x576a.jpg

Context in 'content':
...maar krijgt de bal hard op zijn neus en via de grond gaat de bal over. Van Wolfswinkel houdt er een bloedneus aan over.</p><p>FC Twente - Besiktas: 0-0</p><h2>AJA-GAL | 23' GOAL! Het is 1-0 voor Ajax</h2><p>GO...


--- Match 2 ---
Found in column: content
Published: 2024-11-20 18:19:49
Year: 2024
channel: nos
url: https://nos.nl/l/2545266
type: liveblog
title: Dapper Twente kan Real Madrid net geen p

In [13]:
# Yearly summary of matches
if matches:
    all_matches = pd.concat(matches, ignore_index=True)
    
    # Extract year from published_time and create summary
    all_matches['year'] = all_matches['published_time'].dt.year
    yearly_summary = all_matches.groupby(['year', 'found_in_column']).size().reset_index(name='count')
    yearly_total = all_matches.groupby('year').size().reset_index(name='total_matches')
    
    print(f"\n{'='*50}")
    print("YEARLY SUMMARY OF MATCHES")
    print(f"{'='*50}")
    
    for year in sorted(yearly_total['year'].unique(), reverse=True):
        year_total = yearly_total[yearly_total['year'] == year]['total_matches'].iloc[0]
        print(f"\n{year}: {year_total} total matches")
        
        # Show breakdown by column for this year
        year_data = yearly_summary[yearly_summary['year'] == year]
        for _, row in year_data.iterrows():
            print(f"  - {row['found_in_column']}: {row['count']} matches")
    
    # Create and display summary DataFrame
    print(f"\n{'='*30}")
    print("Summary Table:")
    display(yearly_total.sort_values('year', ascending=False))


YEARLY SUMMARY OF MATCHES

2025: 1 total matches
  - content: 1 matches

2024: 3 total matches
  - content: 3 matches

2023: 1 total matches
  - content: 1 matches

2022: 7 total matches
  - content: 7 matches

2021: 10 total matches
  - content: 10 matches

2020: 5 total matches
  - content: 3 matches
  - title: 1 matches
  - url: 1 matches

2019: 2 total matches
  - content: 2 matches

2018: 11 total matches
  - content: 11 matches

2017: 5 total matches
  - content: 5 matches

2016: 16 total matches
  - content: 11 matches
  - description: 1 matches
  - title: 2 matches
  - url: 2 matches

2015: 16 total matches
  - content: 11 matches
  - description: 1 matches
  - title: 2 matches
  - url: 2 matches

Summary Table:


Unnamed: 0,year,total_matches
10,2025,1
9,2024,3
8,2023,1
7,2022,7
6,2021,10
5,2020,5
4,2019,2
3,2018,11
2,2017,5
1,2016,16


## 5. Count Word Occurrences

Calculate the total number of times the word appears in the dataset with breakdown by columns.

In [10]:
# Detailed occurrence count
print(f"Detailed word occurrence analysis for '{search_word}':")
print("=" * 50)

occurrence_summary = {}
total_word_count = 0

for col in text_columns:
    col_data = df[col].fillna('').astype(str)
    
    # Count total occurrences (not just rows containing the word)
    word_count_in_col = 0
    for text in col_data:
        word_count_in_col += len(re.findall(re.escape(search_word), text, re.IGNORECASE))
    
    if word_count_in_col > 0:
        occurrence_summary[col] = word_count_in_col
        total_word_count += word_count_in_col
        print(f"Column '{col}': {word_count_in_col} occurrences")

print(f"\nTotal word occurrences across all columns: {total_word_count}")
print(f"Total unique articles/rows containing the word: {total_count}")

# Create a summary DataFrame
if occurrence_summary:
    summary_df = pd.DataFrame([
        {'Column': col, 'Occurrences': count, 'Percentage': (count/total_word_count)*100}
        for col, count in occurrence_summary.items()
    ])
    
    print("\nSummary table:")
    display(summary_df)
else:
    print(f"\nThe word '{search_word}' was not found in any text columns of the dataset.")

Detailed word occurrence analysis for 'bloedneus':
Column 'url': 5 occurrences
Column 'title': 5 occurrences
Column 'description': 2 occurrences
Column 'content': 89 occurrences

Total word occurrences across all columns: 101
Total unique articles/rows containing the word: 77

Summary table:


Unnamed: 0,Column,Occurrences,Percentage
0,url,5,4.950495
1,title,5,4.950495
2,description,2,1.980198
3,content,89,88.118812
