# News Search Engine
#### TF‑IDF News Headline Search (POLITICS, TRAVEL, SPORTS, HOME & LIVING)

This notebook builds a **searchable index of 4,000 news headlines** (1,000 per category) from the **Kaggle News Category Dataset** and implements a simple search engine using **TF‑IDF** + **cosine similarity**.


## Import the libraries

In [1]:
# Imports
import os
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [2]:
pd.set_option("display.max_colwidth", 200)
print("Libraries imported.")

Libraries imported.


## Data Preprocessing

In [3]:

# === Configuration ===
# Set the dataset path here. Update if your file lives elsewhere.
CANDIDATE_PATHS = [
    "./News_Category_Dataset_v3.json",
    "./News_Category_Dataset_v2.json",
    "./News_Category_Dataset.json",
    "/mnt/data/News_Category_Dataset_v3.json",
    "/mnt/data/News_Category_Dataset_v2.json",
    "/mnt/data/News_Category_Dataset.json",
]

DATASET_PATH = None
for p in CANDIDATE_PATHS:
    if os.path.exists(p):
        DATASET_PATH = p
        break

print("Detected dataset path:", DATASET_PATH)
if DATASET_PATH is None:
    print("⚠️ Could not find the dataset file. Please set DATASET_PATH to the correct file location.")


Detected dataset path: ./News_Category_Dataset_v3.json


## Load Dataset

In [4]:

# === Load Dataset ===
if DATASET_PATH is None:
    raise FileNotFoundError(
        "Dataset file not found. Please download the Kaggle News Category Dataset "
        "and set DATASET_PATH to the JSON Lines file (e.g., News_Category_Dataset_v3.json)."
    )

# Kaggle News Category Dataset is JSON Lines (one record per line)
# Expected columns include: 'headline', 'category', 'short_description', 'link', 'authors', 'date'
df_raw = pd.read_json(DATASET_PATH, lines=True)
expected_cols = {"headline", "category"}
missing = expected_cols - set(df_raw.columns)
if missing:
    raise ValueError(f"Dataset is missing expected columns: {missing}. Columns present: {list(df_raw.columns)}")

print("Raw dataset shape:", df_raw.shape)
df_raw.head(3)


Raw dataset shape: (209527, 6)


Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9,Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters,U.S. NEWS,Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlines-passenger-banned-flight-attendant-punch-justice-department_n_632e25d3e4b0e247890329fe,"American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video",U.S. NEWS,"He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.",Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets-cats-dogs-september-17-23_n_632de332e4b0695c1d81dc02,23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23),COMEDY,"""Until you have a dog you don't understand what could be eaten.""",Elyse Wanshel,2022-09-23


## Balance dataset (1000 per category)

In [5]:
# Filter categories and balance to 1000 per category 

TARGET_CATS = ["POLITICS", "TRAVEL", "SPORTS", "HOME & LIVING"]
df = df_raw[df_raw["category"].isin(TARGET_CATS)].copy()

# Keep only headline + category
df = df.loc[:, ["headline", "category"]]

# Ensure we have enough samples per category
counts = df["category"].value_counts()
print("Counts per category before balancing:\n", counts, "\n")

TARGET_PER_CAT = 1000
rng_seed = 42

def take_exact_n(group, n):
    if len(group) < n:
        raise ValueError(
            f"Category '{group.name}' has only {len(group)} rows; need at least {n}. "
            "Please provide a dataset version with enough rows."
        )
    return group.sample(n=n, random_state=rng_seed)

df_balanced = (
    df.groupby("category", group_keys=False)
      .apply(take_exact_n, n=TARGET_PER_CAT)
      .reset_index(drop=True)
)

# Final sanity checks
final_counts = df_balanced["category"].value_counts().sort_index()
assert all(final_counts[c] == TARGET_PER_CAT for c in TARGET_CATS), "Balancing failed to produce 1000 per category."
print("Balanced counts per category:\n", final_counts)

print("Final balanced dataset shape:", df_balanced.shape)
df_balanced.head(10)


Counts per category before balancing:
 category
POLITICS         35602
TRAVEL            9900
SPORTS            5077
HOME & LIVING     4320
Name: count, dtype: int64 

Balanced counts per category:
 category
HOME & LIVING    1000
POLITICS         1000
SPORTS           1000
TRAVEL           1000
Name: count, dtype: int64
Final balanced dataset shape: (4000, 2)


  .apply(take_exact_n, n=TARGET_PER_CAT)


Unnamed: 0,headline,category
0,"Busiest Shipping Day Of The Year Is Today, Announces US Postal Service",HOME & LIVING
1,What To Watch On Netflix That’s New This Week (July 7-13),HOME & LIVING
2,Repurposing Idea Shows You How To Organize Hair Ties (PHOTOS),HOME & LIVING
3,Company Buys $8000 Horse Lamp By Front Design For Lobby (PHOTO),HOME & LIVING
4,Renovate for Rent,HOME & LIVING
5,A Floating Log Cabin That Combines Tiny Home Living And Lake House Luxury (PHOTOS),HOME & LIVING
6,"Organize Your Life: Use FireFox's MeeTimer To End Procrastination, Boost Productivity",HOME & LIVING
7,How To Remove Gum From Shoes With Peanut Butter,HOME & LIVING
8,Homemade Gift Ideas: Neon Paint Splattered Umbrella,HOME & LIVING
9,"Porsha Williams, Kordell Stewart Divorce Reports Have Us Wondering: Who Will Get Their Gorgeous Home? (VIDEO)",HOME & LIVING


## Vectorization

In [6]:
# === TF-IDF Vectorization ===
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words="english",
    ngram_range=(1, 2),
    min_df=2
)

tfidf_matrix = vectorizer.fit_transform(df_balanced["headline"])
print("TF-IDF matrix shape:", tfidf_matrix.shape)

HEADLINES = df_balanced["headline"].tolist()
CATEGORIES = df_balanced["category"].tolist()

index_df = pd.DataFrame({
    "headline": HEADLINES,
    "category": CATEGORIES
})
index_df.head(10)


TF-IDF matrix shape: (4000, 4736)


Unnamed: 0,headline,category
0,"Busiest Shipping Day Of The Year Is Today, Announces US Postal Service",HOME & LIVING
1,What To Watch On Netflix That’s New This Week (July 7-13),HOME & LIVING
2,Repurposing Idea Shows You How To Organize Hair Ties (PHOTOS),HOME & LIVING
3,Company Buys $8000 Horse Lamp By Front Design For Lobby (PHOTO),HOME & LIVING
4,Renovate for Rent,HOME & LIVING
5,A Floating Log Cabin That Combines Tiny Home Living And Lake House Luxury (PHOTOS),HOME & LIVING
6,"Organize Your Life: Use FireFox's MeeTimer To End Procrastination, Boost Productivity",HOME & LIVING
7,How To Remove Gum From Shoes With Peanut Butter,HOME & LIVING
8,Homemade Gift Ideas: Neon Paint Splattered Umbrella,HOME & LIVING
9,"Porsha Williams, Kordell Stewart Divorce Reports Have Us Wondering: Who Will Get Their Gorgeous Home? (VIDEO)",HOME & LIVING


## Search Implementation

In [7]:
import numpy as np
import pandas as pd

def search_headlines(query: str, top_k: int = 10):
    """Transform the query to TF-IDF, compute cosine similarity, and return a ranked DataFrame."""
    if not isinstance(query, str) or not query.strip():
        raise ValueError("Query must be a non-empty string.")

    q_vec = vectorizer.transform([query])
    sims = linear_kernel(q_vec, tfidf_matrix).ravel()  # shape (n_docs,)

    if top_k <= 0:
        top_k = 10
    top_idx = np.argpartition(-sims, kth=min(top_k, len(sims)-1))[:top_k]
    top_idx = top_idx[np.argsort(-sims[top_idx])]

    results = pd.DataFrame({
        "rank": np.arange(1, len(top_idx)+1),
        "similarity": sims[top_idx],
        "headline": [HEADLINES[i] for i in top_idx],
        "category": [CATEGORIES[i] for i in top_idx],
    })
    results = results.sort_values(by="similarity", ascending=False, kind="mergesort").reset_index(drop=True)
    return results

print("Search function defined.")


Search function defined.


In [8]:
# === Demo ===
demo_query = "election travel restrictions"
demo_results = search_headlines(demo_query, top_k=10)
demo_results


Unnamed: 0,rank,similarity,headline,category
0,1,0.59473,Why I Travel,TRAVEL
1,2,0.411685,"We’re Still, Somehow, A Year Away From The Presidential Election",POLITICS
2,3,0.386794,Obama Has Some Issues With How The Media Are Covering The Election,POLITICS
3,4,0.37419,8 Problems You May Encounter Going To Vote In The Election,HOME & LIVING
4,5,0.349167,Travel (Or Lying About Travel) Might Be The Key To Dating Success,TRAVEL
5,6,0.347071,Shonda Rhimes Says 2016 Election Is Mirroring Her Show 'Scandal',POLITICS
6,7,0.328013,Player In PowerPoint Election Overthrow Plot Reportedly Talked Often With Mark Meadows,POLITICS
7,8,0.326406,Crowd Sourcing The Future of Travel,TRAVEL
8,9,0.320488,Obama Takes Stand Against Populist Candidate In French Election,POLITICS
9,10,0.315908,The DCCC Is Jumping In And The Special Election In Montana Is About To Get A Ton More Attention,POLITICS



### Try your own queries

Run the next cell and type your search terms when prompted.


In [9]:
# === Interactive query ===
try:
    user_query = input("Enter your search query: ").strip()
    if user_query:
        out = search_headlines(user_query, top_k=10)
        display(out[["rank", "headline", "category", "similarity"]])
    else:
        print("No query entered. Skipping.")
except EOFError:
    print("Interactive input not available. Using a default query instead.")
    out = search_headlines("world cup home decor", top_k=10)
    display(out[["rank", "headline", "category", "similarity"]])


Enter your search query:  country


Unnamed: 0,rank,headline,category,similarity
0,1,Do You Need a Country? Here is One!,TRAVEL,0.695307
1,2,Nu Yawka Goes Country in Branson (PHOTOS),TRAVEL,0.475492
2,3,The No. 1 Country To Visit In 2018 If You’re On A Budget,TRAVEL,0.425906
3,4,The Country Club Republican Strikes Back,POLITICS,0.415386
4,5,Children Are Being Housed In Adult Prisons Across The Country. It Has To Stop.,POLITICS,0.403851
5,6,Argentina Wine Country: Beyond Malbec! (PHOTOS),TRAVEL,0.397775
6,7,"South Koreans Hike Hallasan, The Country's Tallest Peak, In Middle Of Winter",TRAVEL,0.391953
7,8,"'The Lego Backpacker' Instagrams The World, One Country At A Time",TRAVEL,0.390041
8,9,What Can Your Country Deliver With the Push of a Button?,POLITICS,0.383115
9,10,787 Dreamliner Draws Boeing Logo Across Country With Flight Path (PHOTO),TRAVEL,0.343108


In [10]:
# === Optional: Save artifacts ===
from pathlib import Path
import joblib

ARTIFACT_DIR = Path("./artifacts")
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)

balanced_csv = ARTIFACT_DIR / "balanced_headlines_4k.csv"
index_df.to_csv(balanced_csv, index=False)
print(f"Saved balanced dataset to: {balanced_csv.resolve()}")

joblib.dump(vectorizer, ARTIFACT_DIR / "tfidf_vectorizer.joblib")
joblib.dump(tfidf_matrix, ARTIFACT_DIR / "tfidf_matrix.joblib")
joblib.dump(CATEGORIES, ARTIFACT_DIR / "categories_list.joblib")
joblib.dump(HEADLINES, ARTIFACT_DIR / "headlines_list.joblib")
print("Saved TF-IDF artifacts to:", ARTIFACT_DIR.resolve())


Saved balanced dataset to: C:\Users\bbuser\Downloads\artifacts\balanced_headlines_4k.csv
Saved TF-IDF artifacts to: C:\Users\bbuser\Downloads\artifacts
