# News Search Engine

As a user of the news platform, I want to search for articles by entering a query,
 So that I can quickly find the most relevant news headlines across multiple categories.

1. Data Preprocessing

In [9]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

we need to load dataset (adjust path to your Kaggle dataset file)

In [2]:
df = pd.read_json("Data/News_Category_Dataset_v3.json", lines=True)

In [3]:
df

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22
...,...,...,...,...,...,...
209522,https://www.huffingtonpost.com/entry/rim-ceo-t...,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,TECH,Verizon Wireless and AT&T are already promotin...,"Reuters, Reuters",2012-01-28
209523,https://www.huffingtonpost.com/entry/maria-sha...,Maria Sharapova Stunned By Victoria Azarenka I...,SPORTS,"Afterward, Azarenka, more effusive with the pr...",,2012-01-28
209524,https://www.huffingtonpost.com/entry/super-bow...,"Giants Over Patriots, Jets Over Colts Among M...",SPORTS,"Leading up to Super Bowl XLVI, the most talked...",,2012-01-28
209525,https://www.huffingtonpost.com/entry/aldon-smi...,Aldon Smith Arrested: 49ers Linebacker Busted ...,SPORTS,CORRECTION: An earlier version of this story i...,,2012-01-28


just we need to keep only required categories

In [4]:
categories = ["POLITICS", "TRAVEL", "SPORTS", "HOME & LIVING"]
df_filtered = df[df["category"].isin(categories)]

Balance: take 1000 per category

In [5]:
df_balanced = df_filtered.groupby("category").head(1000).copy()

Keep only 'headline' and 'category'

In [6]:
df_balanced = df_balanced[["headline", "category"]].reset_index(drop=True)

In [7]:
print(df_balanced.shape)

(4000, 2)


In [8]:
print(df_balanced.head())

                                            headline  category
0  Maury Wills, Base-Stealing Shortstop For Dodge...    SPORTS
1  Biden Says U.S. Forces Would Defend Taiwan If ...  POLITICS
2  ‘Beautiful And Sad At The Same Time’: Ukrainia...  POLITICS
3  Las Vegas Aces Win First WNBA Title, Chelsea G...    SPORTS
4  Biden Says Queen's Death Left 'Giant Hole' For...  POLITICS


# 2. Vectorization (TF-IDF)

#### we will focus in 4 steps which are:
###### Transform query into TF-IDF vector
###### Compute cosine similarity with all headlines
###### Get top k indices
###### Prepare results

In [10]:
def search(query, top_k=10):
    
    query_vec = vectorizer.transform([query])
    
    
    similarities = cosine_similarity(query_vec, X).flatten()
    
    
    top_indices = similarities.argsort()[::-1][:top_k]
    
    
    results = []
    for idx in top_indices:
        headline = df_balanced.iloc[idx]["headline"]
        category = df_balanced.iloc[idx]["category"]
        score = similarities[idx]
        results.append((headline, category, score))
    
    return results


# 3. Search Implementation

In [14]:


def search(query, top_k=10):
    # Transform query into TF-IDF vector
    query_vec = vectorizer.transform([query])
    
    # Compute cosine similarity with all headlines
    similarities = cosine_similarity(query_vec, X).flatten()
    
    # Get top k indices
    top_indices = similarities.argsort()[::-1][:top_k]
    
    # Prepare results
    results = []
    for idx in top_indices:
        headline = df_balanced.iloc[idx]["headline"]
        category = df_balanced.iloc[idx]["category"]
        score = similarities[idx]
        results.append((headline, category, score))
    
    return results


# 4- User Experience (Testing Search)

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words="english")

# Fit on the headlines
X = vectorizer.fit_transform(df_balanced["headline"])


In [18]:
def search(query, top_k=10):
    # Transform query into TF-IDF vector
    query_vec = vectorizer.transform([query])
    
    # Compute cosine similarity with all headlines
    similarities = cosine_similarity(query_vec, X).flatten()
    
    # Get top k indices
    top_indices = similarities.argsort()[::-1][:top_k]
    
    # Prepare results
    results = []
    for idx in top_indices:
        headline = df_balanced.iloc[idx]["headline"]
        category = df_balanced.iloc[idx]["category"]
        score = similarities[idx]
        results.append((headline, category, score))
    
    return results

In [19]:
# Example query
query = "government election debate"
results = search(query)

# Display results
print("\nTop 10 Results for:", query)
for rank, (headline, category, score) in enumerate(results, 1):
    print(f"{rank}. {headline}  |  {category}  |  Score: {score:.4f}")



Top 10 Results for: government election debate
1. Biden Slams Republicans For Blocking Debate On Voting Rights Bill  |  POLITICS  |  Score: 0.2870
2. Why Travel Fees Should Be Regulated By The Government  |  TRAVEL  |  Score: 0.2860
3. What To Know About The Growing Debate Over COVID-19 Vaccine Patents And Equity  |  POLITICS  |  Score: 0.2595
4. RNC Ripped, Ridiculed Over Presidential Debate Ban Threat  |  POLITICS  |  Score: 0.2524
5. Republicans Block Debate On Voting Rights Bill, Setting Up Summer Filibuster Fight  |  POLITICS  |  Score: 0.2459
6. Government Tries To Protect Air Travelers. Will Anyone Notice?  |  TRAVEL  |  Score: 0.2325
7. Senate Republicans Threaten Government Shutdown Over ‘Vaccine Mandate’  |  POLITICS  |  Score: 0.2259
8. And They're Back! Buyer Demand Rebounds After Government Reopens  |  HOME & LIVING  |  Score: 0.2229
9. Biden Signs Stopgap Spending Bill Averting Government Shutdown  |  POLITICS  |  Score: 0.2067
10. House Passes Government Funding As Sena