Data Preprocessing
Load dataset and filter categories.
Balance dataset (1000 per category).
Keep only headline and category.

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [2]:
df = pd.read_json('News_Category_Dataset_v3.json', lines=True)
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


Keep only the four categories: POLITICS, TRAVEL, SPORTS, HOME & LIVING.

In [5]:
categories = ['POLITICS', 'TRAVEL', 'SPORTS', 'HOME & LIVING']
filter_by_categories = df[df['category'].isin(categories)][['headline', 'category']]

Balance dataset: 1000 per category

In [8]:
balanced = filter_by_categories.groupby('category').apply(lambda x: x.sample(1000, random_state=42)).reset_index(drop=True)
print(balanced['category'].value_counts())
balanced.head(12)

category
HOME & LIVING    1000
POLITICS         1000
SPORTS           1000
TRAVEL           1000
Name: count, dtype: int64


  balanced = filter_by_categories.groupby('category').apply(lambda x: x.sample(1000, random_state=42)).reset_index(drop=True)


Unnamed: 0,headline,category
0,"Busiest Shipping Day Of The Year Is Today, Ann...",HOME & LIVING
1,What To Watch On Netflix That’s New This Week ...,HOME & LIVING
2,Repurposing Idea Shows You How To Organize Hai...,HOME & LIVING
3,Company Buys $8000 Horse Lamp By Front Design ...,HOME & LIVING
4,Renovate for Rent,HOME & LIVING
5,A Floating Log Cabin That Combines Tiny Home L...,HOME & LIVING
6,Organize Your Life: Use FireFox's MeeTimer To ...,HOME & LIVING
7,How To Remove Gum From Shoes With Peanut Butter,HOME & LIVING
8,Homemade Gift Ideas: Neon Paint Splattered Umb...,HOME & LIVING
9,"Porsha Williams, Kordell Stewart Divorce Repor...",HOME & LIVING


Use TF-IDF vectorization on the article headlines.
Build a searchable vector space for the 4000 headlines

In [10]:
vectorizer = TfidfVectorizer(stop_words='english')
x = vectorizer.fit_transform(balanced['headline'])
x.shape

(4000, 8302)

Search Implementation
Accept user queries.
Transform query into TF-IDF vector.
Compute cosine similarity with all article vectors.
Return top 10 results.

In [12]:
def search_headlines(query, top_k=10):
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(query_vector, x)[0]
    top_idx = np.argsort(similarities)[::-1][:top_k]
    results = []
    for idx in top_idx:
        results.append({
            'headline': balanced.iloc[idx]['headline'],
            'category': balanced.iloc[idx]['category'],
            'score': similarities[idx]
        })
    return results

User Experience
Results should include:
Headline text
Category label
Similarity score
Results should be clearly ranked.

In [14]:
user_query = "Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters"
results = search_headlines(user_query)
print(f"Top 10 results for: '{user_query}'\n")
for i, res in enumerate(results, 1):
    print(f"{i}. [{res['category']}] {res['headline']}")
    print(f"   Similarity Score: {res['score']:.3f}\n")

Top 10 results for: 'Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters'

1. [POLITICS] 3.7 Million Americans Would Lose Food Stamps Under Trump Administration Rules: Study
   Similarity Score: 0.236

2. [TRAVEL] New Orleans: Where the Good Times Roll All Year Long
   Similarity Score: 0.215

3. [POLITICS] Wisconsin, 20 Other U.S. States Targeted By Russian Hackers In 2016 Election
   Similarity Score: 0.204

4. [POLITICS] Ohio Girl's Abortion Doctor Once Targeted In Vicious Kidnapping Threat: Report
   Similarity Score: 0.183

5. [POLITICS] Trump Could Roll Back Decades Of Progress That Made Immigrant Detention More Humane
   Similarity Score: 0.182

6. [POLITICS] Americans Are Now More Worried About Health Care Than Anything Else
   Similarity Score: 0.180

7. [TRAVEL] Way Too Many Americans Didn't Take Enough Vacation Days Last Year
   Similarity Score: 0.164

8. [POLITICS] Two Books Bill O'Reilly--and all Americans--Should Read
   Similarity Score: 0.160

