**<h1>ASSIGNMENT 3</h1>**

<h2> Q1. Problem Statement</h2>

Design an Information Retrieval System that ranks a collection of headlines based on their relevance to a user's search query. Use the raw term frequency approach to represent headlines(documents) and compute similarity for ranking. 

After implementation, test your system with at least three different queries and display the ranked results clearly.


Dataset:
[Headlines Dataset](https://www.kaggle.com/datasets/therohk/million-headlines)



In [None]:
import pandas as pd
import re
from collections import Counter

df = pd.read_csv('abcnews-date-text.csv')
headlines = df['headline_text']
headlines=headlines.astype(str).tolist()

def preprocessdata(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text.split()

tf_vectors = [Counter(preprocessdata(h)) for h in headlines]

def rank(query, tf_vectors, headlines, top_n=3):
    query_tf = Counter(preprocessdata(query))
    scores = []
    for tf in tf_vectors:
        score = sum(query_tf[word] * tf[word] for word in query_tf)
        scores.append(score)
    ranked = sorted(zip(headlines, scores), key=lambda x: x[1], reverse=True)
    ranked = [item for item in ranked if item[1] > 0]
    return ranked[:top_n]

queries = [
    "air nz strike",
    "natural disaster",
    "sports championship"
]

for q in queries:
    print(f"\nTop results for query: '{q}'")
    results = rank(q, tf_vectors, headlines)
    for headline, score in results:
        print(f"Score: {score} | {headline}")


Top results for query: 'air nz strike'
Score: 3 | air nz staff in aust strike for pay rise
Score: 3 | air nz strike to affect australian travellers
Score: 3 | air nz staff to strike

Top results for query: 'natural disaster'
Score: 2 | natural disaster areas declared
Score: 2 | natural disaster areas declared after downpour
Score: 2 | amery wont declare drought a natural disaster

Top results for query: 'sports championship'
Score: 2 | two sports enthusiasts set out to see 365 sports in a year
Score: 2 | nsw rugby 7s championship to boost sports popularity
Score: 2 | fox sports foi request 30 million sports deal


<h2> Q2. Problem Statement</h2>

Enhance your Retrieval System by using TF-IDF weighting instead of raw term frequency. Update the term-document matrix and query representation accordingly, then compute similarity scores and rank the headlines. 

After implementation, test your system with at least three different queries and display the ranked results clearly.


In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('abcnews-date-text.csv')
headlines = df['headline_text'].astype(str).tolist()

vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(headlines)

queries = [
    "election results",
    "man",
    "rudd"
]

for q in queries:
    query_vec = vectorizer.transform([q])
    scores = (tfidf_matrix @ query_vec.T).toarray().flatten()
    top_indices = scores.argsort()[::-1][:3]
    print(f"\nTop results for query: '{q}'")
    for idx in top_indices:
        if scores[idx] > 0:
            print(f"Score: {scores[idx]:.4f} | {headlines[idx]}")


Top results for query: 'election results'
Score: 0.8763 | results of the nsw by election
Score: 0.7615 | nauru election results
Score: 0.7613 | thailand election results

Top results for query: 'man'
Score: 1.0000 | man found
Score: 0.7186 | man charged after man hunt
Score: 0.6852 | man charged over the death of melbourne man

Top results for query: 'rudd'
Score: 1.0000 | rudd at un
Score: 0.7998 | rudd back in australia
Score: 0.7581 | rudd an attack on us all
