# 📰 News Headline Search Engine

Search for the most relevant news headlines across multiple categories using TF-IDF vectorization and cosine similarity.

---
## 🚦 Categories Searched
- **POLITICS**
- **TRAVEL**
- **SPORTS**
- **HOME & LIVING**

---

## 1️⃣ Data Preprocessing

In [1]:
import pandas as pd

In [2]:
# Load Kaggle News Category Dataset
df = pd.read_json('News_Category_Dataset_v3.json', lines=True)
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [3]:
# Filter for the four categories & keep only headline and category
categories = ['POLITICS', 'TRAVEL', 'SPORTS', 'HOME & LIVING']
filtered = df[df['category'].isin(categories)][['headline', 'category']]

In [4]:
# Balance dataset: 1000 per category
balanced = filtered.groupby('category').apply(lambda x: x.sample(1000, random_state=42)).reset_index(drop=True)
print(balanced['category'].value_counts())
balanced.head()

category
HOME & LIVING    1000
POLITICS         1000
SPORTS           1000
TRAVEL           1000
Name: count, dtype: int64


  balanced = filtered.groupby('category').apply(lambda x: x.sample(1000, random_state=42)).reset_index(drop=True)


Unnamed: 0,headline,category
0,"Busiest Shipping Day Of The Year Is Today, Ann...",HOME & LIVING
1,What To Watch On Netflix That’s New This Week ...,HOME & LIVING
2,Repurposing Idea Shows You How To Organize Hai...,HOME & LIVING
3,Company Buys $8000 Horse Lamp By Front Design ...,HOME & LIVING
4,Renovate for Rent,HOME & LIVING


---
## 2️⃣ TF-IDF Vectorization

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Train TF-IDF vectorizer on 4000 headlines
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(balanced['headline'])

print(f'Headline TF-IDF matrix shape: {X.shape}')

Headline TF-IDF matrix shape: (4000, 8302)


---

## 3️⃣ Search Function Implementation

In [6]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search_headlines(query, top_k=10):
    # Transform query to TF-IDF
    query_vec = vectorizer.transform([query])
    # Compute cosine similarity
    sims = cosine_similarity(query_vec, X)[0]
    # Get top results
    top_idx = np.argsort(sims)[::-1][:top_k]
    results = []
    for idx in top_idx:
        results.append({
            'headline': balanced.iloc[idx]['headline'],
            'category': balanced.iloc[idx]['category'],
            'score': sims[idx]
        })
    return results

---

## 4️⃣ Example User Search

In [7]:
user_query = "president election results"
results = search_headlines(user_query)

# Display Results
print(f"Top 10 results for: '{user_query}'\n")
for i, res in enumerate(results, 1):
    print(f"{i}. [{res['category']}] {res['headline']}")
    print(f"   Similarity Score: {res['score']:.3f}\n")

Top 10 results for: 'president election results'

1. [SPORTS] U.S. Open Results: Novak Djokovic Defeats Julien Benneteau In Third Round
   Similarity Score: 0.278

2. [POLITICS] We’re Still, Somehow, A Year Away From The Presidential Election
   Similarity Score: 0.266

3. [POLITICS] Protecting America From Its President
   Similarity Score: 0.235

4. [POLITICS] Lying To The Press Is Nothing New For The President
   Similarity Score: 0.215

5. [POLITICS] Obama Has Some Issues With How The Media Are Covering The Election
   Similarity Score: 0.209

6. [TRAVEL] President Obama Hawaii: What To Do On Oahu (PHOTOS)
   Similarity Score: 0.207

7. [HOME & LIVING] 8 Problems You May Encounter Going To Vote In The Election
   Similarity Score: 0.204

8. [POLITICS] Obama To Visit A Mosque For The First Time As President
   Similarity Score: 0.196

9. [POLITICS] This President's Tweeting Is Squandering Our Time
   Similarity Score: 0.190

10. [POLITICS] Barack Obama Sanctions Russia Over Election

---

## 5️⃣ Pretty Table Display

In [8]:
def display_results(query, results):
    df = pd.DataFrame(results)
    df['Rank'] = range(1, len(df)+1)
    df = df[['Rank', 'headline', 'category', 'score']]
    df = df.rename(columns={'headline': 'Headline', 'category': 'Category', 'score': 'Similarity Score'})
    print(f"\nSearch Results for: '{query}'")
    display(df.style.background_gradient(subset=['Similarity Score'], cmap='Blues'))

# Run with pretty table:
display_results(user_query, results)


Search Results for: 'president election results'


Unnamed: 0,Rank,Headline,Category,Similarity Score
0,1,U.S. Open Results: Novak Djokovic Defeats Julien Benneteau In Third Round,SPORTS,0.277896
1,2,"We’re Still, Somehow, A Year Away From The Presidential Election",POLITICS,0.26579
2,3,Protecting America From Its President,POLITICS,0.234932
3,4,Lying To The Press Is Nothing New For The President,POLITICS,0.214921
4,5,Obama Has Some Issues With How The Media Are Covering The Election,POLITICS,0.208704
5,6,President Obama Hawaii: What To Do On Oahu (PHOTOS),TRAVEL,0.207233
6,7,8 Problems You May Encounter Going To Vote In The Election,HOME & LIVING,0.203884
7,8,Obama To Visit A Mosque For The First Time As President,POLITICS,0.195724
8,9,This President's Tweeting Is Squandering Our Time,POLITICS,0.190157
9,10,Barack Obama Sanctions Russia Over Election Meddling,POLITICS,0.189131
