#### **As a user of the news platform, I want to search for articles by entering a query, So that I can quickly find the most relevant news headlines across multiple categories.**

## **1. Data Preprocessing**
1.1. Load dataset and filter categories.

1.2. Balance dataset (1000 per category).

1.3. Keep only headline and category.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# 1. Load & Prepare Dataset
df = pd.read_json("News_Category_Dataset_v3.json", lines=True)

In [3]:
# Keep only needed categories
categories = ["POLITICS", "TRAVEL", "SPORTS", "HOME & LIVING"]
df = df[df["category"].isin(categories)][["headline", "category"]]

In [4]:
df = df.groupby("category").apply(lambda x: x.sample(1000, random_state=42)).reset_index(drop=True)

  df = df.groupby("category").apply(lambda x: x.sample(1000, random_state=42)).reset_index(drop=True)


In [5]:
df = df[["headline", "category"]]
df

Unnamed: 0,headline,category
0,"Busiest Shipping Day Of The Year Is Today, Ann...",HOME & LIVING
1,What To Watch On Netflix That’s New This Week ...,HOME & LIVING
2,Repurposing Idea Shows You How To Organize Hai...,HOME & LIVING
3,Company Buys $8000 Horse Lamp By Front Design ...,HOME & LIVING
4,Renovate for Rent,HOME & LIVING
...,...,...
3995,The 7 Most Mysterious Stone-Carved Faces That ...,TRAVEL
3996,Tips for a Stress-Free Family Summer Vacation,TRAVEL
3997,These Are The Busiest Flight Routes In The World,TRAVEL
3998,"This Is The Best, Most Underrated Travel Resource",TRAVEL


### **2. Vectorization**
2.1. Train a TF-IDF Vectorizer on the 4000 headlines.
    
2.2. Store vectors for all articles.

In [9]:
# 2. TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words="english")
X= vectorizer.fit_transform(df["headline"])

### **3.  Search Implementation**
3.1.  Accept user queries.

3.2. Transform query into TF-IDF vector.

3.3. Compute cosine similarity with all article vectors.

3.4. Return top 10 results.

In [10]:
# 3. Search Function
def search_news(query, top_n=10):
    # Transform query to TF-IDF
    query_vec = vectorizer.transform([query])
    # Compute cosine similarity
    similarities = cosine_similarity(query_vec, X).flatten()
    # Get top N indices
    top_idx = similarities.argsort()[::-1][:top_n]
    # Collect results
    results = []
    for idx in top_idx:
        results.append({
            "headline": df.iloc[idx]["headline"],
            "category": df.iloc[idx]["category"],
            "score": round(similarities[idx], 3)
        })
    return pd.DataFrame(results)

### **4. User Experience**

4.1. Results should include:

4.1.1.  Headline text

4.1.2.  Category label

4.1.3.  Similarity score

4.2. Results should be clearly ranked.

In [11]:
query = "election campaign president"
results = search_news(query)
results

Unnamed: 0,headline,category,score
0,"We’re Still, Somehow, A Year Away From The Pre...",POLITICS,0.306
1,Protecting America From Its President,POLITICS,0.27
2,Lying To The Press Is Nothing New For The Pres...,POLITICS,0.247
3,Obama Has Some Issues With How The Media Are C...,POLITICS,0.24
4,Hillary Clinton Is On Her Way To A $1 Billion ...,POLITICS,0.239
5,President Obama Hawaii: What To Do On Oahu (PH...,TRAVEL,0.238
6,8 Problems You May Encounter Going To Vote In ...,HOME & LIVING,0.235
7,This Is What It's Like To Spend A Week On A Pr...,POLITICS,0.23
8,Bernie Sanders’ Campaign Is In Big Trouble Wit...,POLITICS,0.226
9,Obama To Visit A Mosque For The First Time As ...,POLITICS,0.225
