<a href="https://colab.research.google.com/github/mintycake420/Basic-Exercises-for-courses/blob/main/InformationRetreival_EX03_211718366.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**InformationRetreival_EX03_211718366.ipynb**
# **Rocchio Algorithm - Query Optimization**



---

### **Assignment Requirements:**

**A. Input Preparation:**
- 50 documents represented as vectors using Bag of Words method with TF-IDF scores
- Words: ***team, coach, hockey, baseball, soccer, penalty, score, win, loss, season***
- If the word does not appear: score 0
- If the word appears: random score between 2 and 6
- 20% relevant documents, 80% non-relevant

**B. Finding q_opt:**
- q_opt = μ_R + μ_R - μ_NR = 2*μ_R - μ_NR

**C. Displaying Results:**
1. The five most significant features in q_opt
2. The 3 closest documents to q_opt with their vectors and labels

---

**Submitted by:** Yotam Katz  
**Date:** November 2025  
**Course:** אחזור מידע 26 3700 א01  
**Lecturer:** Dr. Moshe Friedman  
**ID:** 211718366  
**Email:** Yotamkatz2000@gmail.com

---
## **Step 0: Install Libraries and Setup**

In [1]:
# Install Wikipedia library
!pip install wikipedia -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


In [2]:
# Import all required libraries
import wikipedia
import re
from collections import Counter
import numpy as np
import pandas as pd
import random
from google.colab import drive
import os
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')

print("✓ All libraries imported successfully!")

✓ All libraries imported successfully!


In [3]:
# Configuration
SPORTS_TERMS = ['team', 'coach', 'hockey', 'baseball', 'soccer',
                'penalty', 'score', 'win', 'loss', 'season']

NUM_DOCUMENTS = 50
RELEVANT_PERCENTAGE = 0.20  # 20% relevant, 80% non-relevant

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

print("Configuration:")
print(f"  Terms: {', '.join(SPORTS_TERMS)}")
print(f"  Number of documents: {NUM_DOCUMENTS}")
print(f"  Relevant percentage: {RELEVANT_PERCENTAGE*100}%")

Configuration:
  Terms: team, coach, hockey, baseball, soccer, penalty, score, win, loss, season
  Number of documents: 50
  Relevant percentage: 20.0%


In [4]:
# Mount Google Drive
drive.mount('/content/drive')

# Create output directory
output_path = '/content/drive/MyDrive/Colab Notebooks/Information Retreival/Rocchio'
os.makedirs(output_path, exist_ok=True)

print(f"\n✓ Google Drive mounted")
print(f"✓ Output directory: {output_path}")

Mounted at /content/drive

✓ Google Drive mounted
✓ Output directory: /content/drive/MyDrive/Colab Notebooks/Information Retreival/Rocchio


---
## **Part A: Fetch Wikipedia Articles and Create TF-IDF Vectors**


In [5]:
def fetch_sports_articles(num_articles=50):
    """
    Fetch Wikipedia articles related to sports
    """
    articles = []
    print(f"Fetching {num_articles} sports-related Wikipedia articles...")
    print("=" * 70)

    # List of sports-related topics to search
    sports_topics = [
        "Football", "Basketball", "Baseball", "Hockey", "Soccer",
        "Tennis", "Golf", "Cricket", "Rugby", "Volleyball",
        "Olympics", "World Cup", "NBA", "NFL", "MLB", "NHL",
        "Premier League", "Champions League", "Super Bowl",
        "Wimbledon", "Athletics", "Swimming", "Boxing",
        "UFC", "Wrestling", "Marathon", "FIFA",
        "Coach", "Stadium", "Championship", "Tournament",
        "Sports team", "Game", "Match", "Season",
        "Michael Jordan", "Lionel Messi", "Tom Brady",
        "Cristiano Ronaldo", "LeBron James"
    ]

    attempts = 0
    max_attempts = num_articles * 3
    used_titles = set()

    while len(articles) < num_articles and attempts < max_attempts:
        attempts += 1

        try:
            # 70% chance to search for sports topic, 30% random
            if random.random() < 0.7 and sports_topics:
                topic = random.choice(sports_topics)
                search_results = wikipedia.search(topic, results=5)
                if search_results:
                    page_title = random.choice(search_results)
                else:
                    continue
            else:
                page_title = wikipedia.random(1)

            # Skip if already fetched
            if page_title in used_titles:
                continue

            # Fetch the page
            page = wikipedia.page(page_title, auto_suggest=False)

            # Check if it contains sports terms and has substantial content
            content_lower = page.content.lower()
            has_sports_term = any(term in content_lower for term in SPORTS_TERMS)

            if len(page.content) > 500 and has_sports_term:
                articles.append({
                    'title': page.title,
                    'content': page.content
                })
                used_titles.add(page_title)
                print(f"✓ {len(articles)}/{num_articles}: {page.title}")

        except wikipedia.exceptions.DisambiguationError:
            continue
        except wikipedia.exceptions.PageError:
            continue
        except Exception as e:
            continue

    print("=" * 70)
    print(f"Successfully fetched {len(articles)} articles\n")
    return articles

In [6]:
# Fetch articles
print("\n" + "=" * 70)
print("STEP 1: FETCHING WIKIPEDIA ARTICLES")
print("=" * 70 + "\n")

articles = fetch_sports_articles(num_articles=NUM_DOCUMENTS)

print(f"\n✓ Successfully fetched {len(articles)} articles")
print("\nSample articles:")
for i, article in enumerate(articles[:5], 1):
    print(f"  {i}. {article['title']}")


STEP 1: FETCHING WIKIPEDIA ARTICLES

Fetching 50 sports-related Wikipedia articles...
✓ 1/50: National Basketball Association
✓ 2/50: Season
✓ 3/50: Baseball
✓ 4/50: MLB.com
✓ 5/50: Michael B. Jordan
✓ 6/50: Olympiastadion (Berlin)
✓ 7/50: American football
✓ 8/50: Swimming (sport)
✓ 9/50: Swimming
✓ 10/50: Volkswagen Golf
✓ 11/50: List of Major League Baseball team rosters
✓ 12/50: Wrestling
✓ 13/50: List of career achievements by LeBron James
✓ 14/50: Association football
✓ 15/50: Super Bowl
✓ 16/50: FIFA U-17 World Cup
✓ 17/50: Weight class (boxing)
✓ 18/50: Pro Evolution Soccer
✓ 19/50: MTV Splitsvilla
✓ 20/50: Professional wrestling
✓ 21/50: Ice hockey
✓ 22/50: Major League Soccer
✓ 23/50: Sport of athletics
✓ 24/50: Rugby union
✓ 25/50: National Cathedral of Romania
✓ 26/50: The Roast of Tom Brady
✓ 27/50: MLB Network
✓ 28/50: Tennis
✓ 29/50: Winter Olympic Games
✓ 30/50: Wrestling Isn't Wrestling
✓ 31/50: WWE Championship
✓ 32/50: Michael Jordan
✓ 33/50: International Cricket C

In [7]:
def clean_text(text):
    """Clean and tokenize text"""
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    return text.split()

def calculate_tf_idf_for_document(article, term):
    """
    Calculate TF-IDF score for a term in a document
    According to assignment:
    - If term does NOT appear: score = 0
    - If term DOES appear: random score between 2.0 and 6.0
    """
    words = clean_text(article['content'])

    if term not in words:
        return 0.0

    # Random TF-IDF score between 2.0 and 6.0
    tfidf_score = random.uniform(2.0, 6.0)
    return round(tfidf_score, 2)

def create_document_vectors(articles, terms):
    """
    Create TF-IDF vectors for all documents
    Returns: numpy array of shape (num_docs, num_terms)
    """
    print("Creating TF-IDF document vectors...")
    print(f"Number of documents: {len(articles)}")
    print(f"Number of terms: {len(terms)}")
    print(f"Terms: {', '.join(terms)}\n")

    vectors = []

    for i, article in enumerate(articles):
        vector = [calculate_tf_idf_for_document(article, term) for term in terms]
        vectors.append(vector)

        # Show first 3 vectors as examples
        if i < 3:
            print(f"Doc {i+1} ({article['title'][:40]}...)")
            print(f"  Vector: {vector}\n")

    return np.array(vectors)

In [8]:
# Create TF-IDF vectors
print("\n" + "=" * 70)
print("STEP 2: CREATING TF-IDF VECTORS")
print("=" * 70 + "\n")

vectors = create_document_vectors(articles, SPORTS_TERMS)

print(f"✓ Created {len(vectors)} document vectors")
print(f"✓ Vector shape: {vectors.shape}")
print(f"   ({vectors.shape[0]} documents × {vectors.shape[1]} terms)")


STEP 2: CREATING TF-IDF VECTORS

Creating TF-IDF document vectors...
Number of documents: 50
Number of terms: 10
Terms: team, coach, hockey, baseball, soccer, penalty, score, win, loss, season

Doc 1 (National Basketball Association...)
  Vector: [2.26, 2.08, 4.22, 4.35, 0.0, 0.0, 2.03, 4.83, 0.0, 2.24]

Doc 2 (Season...)
  Vector: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.27]

Doc 3 (Baseball...)
  Vector: [2.13, 3.32, 0.0, 4.06, 3.11, 3.94, 4.16, 4.89, 5.53, 4.3]

✓ Created 50 document vectors
✓ Vector shape: (50, 10)
   (50 documents × 10 terms)


In [9]:
def assign_relevance_labels(num_docs, relevant_percentage):
    """
    Randomly assign relevance labels
    20% relevant (label=1), 80% non-relevant (label=0)
    """
    num_relevant = int(num_docs * relevant_percentage)

    # Create labels: 1 = relevant, 0 = non-relevant
    labels = [1] * num_relevant + [0] * (num_docs - num_relevant)

    # Shuffle randomly
    random.shuffle(labels)

    return np.array(labels)

In [10]:
# Assign labels
labels = assign_relevance_labels(len(articles), RELEVANT_PERCENTAGE)

num_relevant = np.sum(labels)
num_non_relevant = len(articles) - num_relevant

print("\nRelevance Labels Assigned:")
print(f"  Relevant documents: {num_relevant} ({num_relevant/len(articles)*100:.1f}%)")
print(f"  Non-relevant documents: {num_non_relevant} ({num_non_relevant/len(articles)*100:.1f}%)")

print("\nSample labels:")
for i in range(min(5, len(articles))):
    label_text = "RELEVANT" if labels[i] == 1 else "NON-RELEVANT"
    print(f"  Doc {i+1}: {label_text} - {articles[i]['title'][:50]}")


Relevance Labels Assigned:
  Relevant documents: 10 (20.0%)
  Non-relevant documents: 40 (80.0%)

Sample labels:
  Doc 1: NON-RELEVANT - National Basketball Association
  Doc 2: NON-RELEVANT - Season
  Doc 3: RELEVANT - Baseball
  Doc 4: NON-RELEVANT - MLB.com
  Doc 5: RELEVANT - Michael B. Jordan


---
## **Part B: Calculate q_opt using Rocchio Algorithm**


In [11]:
def calculate_q_opt(vectors, labels):
    """
    Calculate optimal query using Rocchio algorithm

    Formula: q_opt = μ_R + μ_R - μ_NR
                   = 2 * μ_R - μ_NR

    where:
    - μ_R = mean of relevant documents
    - μ_NR = mean of non-relevant documents
    """
    print("\n" + "=" * 70)
    print("CALCULATING q_opt USING ROCCHIO ALGORITHM")
    print("=" * 70)

    # Separate relevant and non-relevant documents
    relevant_docs = vectors[labels == 1]
    non_relevant_docs = vectors[labels == 0]

    print(f"\nNumber of relevant documents: {len(relevant_docs)}")
    print(f"Number of non-relevant documents: {len(non_relevant_docs)}")

    # Calculate means
    mu_R = np.mean(relevant_docs, axis=0)
    mu_NR = np.mean(non_relevant_docs, axis=0)

    print(f"\nμ_R (mean of relevant docs):")
    print(f"  {mu_R}")
    print(f"\nμ_NR (mean of non-relevant docs):")
    print(f"  {mu_NR}")

    # Calculate q_opt = 2*μ_R - μ_NR
    q_opt = 2 * mu_R - mu_NR

    print(f"\nFormula: q_opt = 2*μ_R - μ_NR")
    print(f"\nq_opt (optimal query vector):")
    print(f"  {q_opt}")

    return q_opt, mu_R, mu_NR

In [12]:
# Calculate q_opt
q_opt, mu_R, mu_NR = calculate_q_opt(vectors, labels)


CALCULATING q_opt USING ROCCHIO ALGORITHM

Number of relevant documents: 10
Number of non-relevant documents: 40

μ_R (mean of relevant docs):
  [1.87  0.577 1.148 0.882 1.325 0.831 2.226 2.227 1.164 2.747]

μ_NR (mean of non-relevant docs):
  [2.664   1.24725 0.7635  1.3055  0.57525 0.8415  1.733   1.97225 1.08075
 2.00425]

Formula: q_opt = 2*μ_R - μ_NR

q_opt (optimal query vector):
  [ 1.076   -0.09325  1.5325   0.4585   2.07475  0.8205   2.719    2.48175
  1.24725  3.48975]


---
## **Part C: Display Results**


### **C.1: Top 5 Most Significant Features**


In [13]:
def display_top_features(q_opt, terms, top_n=5):
    """
    Display the top N most significant features in q_opt
    """
    print("\n" + "=" * 70)
    print(f"TOP {top_n} MOST SIGNIFICANT FEATURES IN q_opt")
    print("=" * 70)

    # Create list of (term, score) pairs
    feature_scores = list(zip(terms, q_opt))

    # Sort by score (descending)
    feature_scores.sort(key=lambda x: x[1], reverse=True)

    # Display top N
    print(f"\n{'Rank':<6} {'Term':<15} {'Score':<10}")
    print("-" * 35)

    for i, (term, score) in enumerate(feature_scores[:top_n], 1):
        print(f"{i:<6} {term:<15} {score:<10.4f}")

    # Create DataFrame
    df = pd.DataFrame(feature_scores[:top_n], columns=['Term', 'Score'])
    df.insert(0, 'Rank', range(1, len(df) + 1))

    return feature_scores[:top_n], df

In [14]:
# Display top 5 features
top_features, df_features = display_top_features(q_opt, SPORTS_TERMS, top_n=5)

print("\n" + "=" * 70)
print("TOP 5 FEATURES - DATAFRAME VIEW")
print("=" * 70)
display(df_features)


TOP 5 MOST SIGNIFICANT FEATURES IN q_opt

Rank   Term            Score     
-----------------------------------
1      season          3.4897    
2      score           2.7190    
3      win             2.4817    
4      soccer          2.0747    
5      hockey          1.5325    

TOP 5 FEATURES - DATAFRAME VIEW


Unnamed: 0,Rank,Term,Score
0,1,season,3.48975
1,2,score,2.719
2,3,win,2.48175
3,4,soccer,2.07475
4,5,hockey,1.5325


### **C.2: 3 Closest Documents to q_opt**
#### **3 המסמכים הקרובים ביותר**

In [15]:
def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors"""
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)

    if norm_v1 == 0 or norm_v2 == 0:
        return 0.0

    return dot_product / (norm_v1 * norm_v2)

def find_closest_documents(q_opt, vectors, labels, articles, terms, top_n=3):
    """
    Find the N documents closest to q_opt using cosine similarity
    """
    print("\n" + "=" * 70)
    print(f"TOP {top_n} DOCUMENTS CLOSEST TO q_opt")
    print("=" * 70)

    # Calculate cosine similarity for all documents
    similarities = [(i, cosine_similarity(q_opt, vector))
                    for i, vector in enumerate(vectors)]

    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)

    # Display top N
    results = []
    for rank, (doc_idx, sim) in enumerate(similarities[:top_n], 1):
        label = "RELEVANT" if labels[doc_idx] == 1 else "NON-RELEVANT"
        vector = vectors[doc_idx]
        article = articles[doc_idx]

        print(f"\n{'='*70}")
        print(f"RANK {rank}: Document {doc_idx + 1}")
        print(f"{'='*70}")
        print(f"Title: {article['title']}")
        print(f"Label: {label}")
        print(f"Cosine Similarity to q_opt: {sim:.6f}")
        print(f"\nVector (TF-IDF scores):")
        print(f"{'  Term':<15} {'Score':<10}")
        print("  " + "-" * 25)

        for term, score in zip(terms, vector):
            print(f"  {term:<15} {score:.2f}")

        results.append({
            'Rank': rank,
            'Doc_ID': doc_idx + 1,
            'Title': article['title'],
            'Label': label,
            'Similarity': sim,
            'Vector': vector.tolist()
        })

    return results

In [16]:
# Find 3 closest documents
closest_docs = find_closest_documents(q_opt, vectors, labels, articles, SPORTS_TERMS, top_n=3)


TOP 3 DOCUMENTS CLOSEST TO q_opt

RANK 1: Document 43
Title: 2025 NHL entry draft
Label: RELEVANT
Cosine Similarity to q_opt: 0.844128

Vector (TF-IDF scores):
  Term          Score     
  -------------------------
  team            2.72
  coach           0.00
  hockey          5.70
  baseball        0.00
  soccer          0.00
  penalty         0.00
  score           5.13
  win             3.65
  loss            0.00
  season          4.68

RANK 2: Document 22
Title: Major League Soccer
Label: NON-RELEVANT
Cosine Similarity to q_opt: 0.831768

Vector (TF-IDF scores):
  Term          Score     
  -------------------------
  team            3.06
  coach           0.00
  hockey          2.52
  baseball        4.58
  soccer          3.83
  penalty         5.72
  score           5.74
  win             2.04
  loss            4.48
  season          4.25

RANK 3: Document 14
Title: Association football
Label: NON-RELEVANT
Cosine Similarity to q_opt: 0.824191

Vector (TF-IDF scores):
  Term  

In [17]:
# Create DataFrame for the 3 closest documents
print("\n" + "=" * 70)
print("3 CLOSEST DOCUMENTS - SUMMARY TABLE")
print("=" * 70)

df_closest = pd.DataFrame([{
    'Rank': doc['Rank'],
    'Doc_ID': doc['Doc_ID'],
    'Title': doc['Title'][:50],
    'Label': doc['Label'],
    'Similarity': doc['Similarity']
} for doc in closest_docs])

display(df_closest)

# Create detailed vectors DataFrame
print("\n" + "=" * 70)
print("3 CLOSEST DOCUMENTS - DETAILED VECTORS")
print("=" * 70)

vectors_data = []
for doc in closest_docs:
    row = {'Rank': doc['Rank'], 'Doc_ID': doc['Doc_ID']}
    for i, term in enumerate(SPORTS_TERMS):
        row[term] = doc['Vector'][i]
    vectors_data.append(row)

df_vectors = pd.DataFrame(vectors_data)
display(df_vectors)


3 CLOSEST DOCUMENTS - SUMMARY TABLE


Unnamed: 0,Rank,Doc_ID,Title,Label,Similarity
0,1,43,2025 NHL entry draft,RELEVANT,0.844128
1,2,22,Major League Soccer,NON-RELEVANT,0.831768
2,3,14,Association football,NON-RELEVANT,0.824191



3 CLOSEST DOCUMENTS - DETAILED VECTORS


Unnamed: 0,Rank,Doc_ID,team,coach,hockey,baseball,soccer,penalty,score,win,loss,season
0,1,43,2.72,0.0,5.7,0.0,0.0,0.0,5.13,3.65,0.0,4.68
1,2,22,3.06,0.0,2.52,4.58,3.83,5.72,5.74,2.04,4.48,4.25
2,3,14,2.01,3.56,0.0,0.0,5.71,5.14,3.14,4.79,4.92,5.13


---
## **Complete Analysis - All Documents Ranked**


In [18]:
def create_complete_dataframe(vectors, labels, articles, terms, q_opt):
    """Create complete DataFrame with all documents"""
    data = []

    for i, (vector, label) in enumerate(zip(vectors, labels)):
        row = {
            'Doc_ID': i + 1,
            'Title': articles[i]['title'][:50],
            'Label': 'RELEVANT' if label == 1 else 'NON-RELEVANT',
            'Similarity_to_qopt': cosine_similarity(q_opt, vector)
        }

        # Add TF-IDF scores for each term
        for term, score in zip(terms, vector):
            row[term] = score

        data.append(row)

    df = pd.DataFrame(data)
    df = df.sort_values('Similarity_to_qopt', ascending=False).reset_index(drop=True)
    df.insert(0, 'Rank', range(1, len(df) + 1))

    return df

In [19]:
# Create complete DataFrame
print("\n" + "=" * 70)
print("COMPLETE ANALYSIS - ALL DOCUMENTS RANKED BY SIMILARITY")
print("=" * 70)

df_complete = create_complete_dataframe(vectors, labels, articles, SPORTS_TERMS, q_opt)

print("\nTop 15 documents:")
display(df_complete.head(15))

print("\nStatistics:")
print(f"  Total documents: {len(df_complete)}")
print(f"  Relevant in top 10: {df_complete.head(10)['Label'].value_counts().get('RELEVANT', 0)}")
print(f"  Average similarity (all): {df_complete['Similarity_to_qopt'].mean():.4f}")
print(f"  Average similarity (relevant): {df_complete[df_complete['Label']=='RELEVANT']['Similarity_to_qopt'].mean():.4f}")
print(f"  Average similarity (non-relevant): {df_complete[df_complete['Label']=='NON-RELEVANT']['Similarity_to_qopt'].mean():.4f}")


COMPLETE ANALYSIS - ALL DOCUMENTS RANKED BY SIMILARITY

Top 15 documents:


Unnamed: 0,Rank,Doc_ID,Title,Label,Similarity_to_qopt,team,coach,hockey,baseball,soccer,penalty,score,win,loss,season
0,1,43,2025 NHL entry draft,RELEVANT,0.844128,2.72,0.0,5.7,0.0,0.0,0.0,5.13,3.65,0.0,4.68
1,2,22,Major League Soccer,NON-RELEVANT,0.831768,3.06,0.0,2.52,4.58,3.83,5.72,5.74,2.04,4.48,4.25
2,3,14,Association football,NON-RELEVANT,0.824191,2.01,3.56,0.0,0.0,5.71,5.14,3.14,4.79,4.92,5.13
3,4,12,Wrestling,NON-RELEVANT,0.817417,0.0,0.0,0.0,0.0,0.0,0.0,5.01,2.95,0.0,3.63
4,5,3,Baseball,RELEVANT,0.808052,2.13,3.32,0.0,4.06,3.11,3.94,4.16,4.89,5.53,4.3
5,6,34,Super Bowl LIX,NON-RELEVANT,0.798319,2.18,3.33,0.0,0.0,0.0,2.52,5.92,2.65,3.77,4.82
6,7,18,Pro Evolution Soccer,RELEVANT,0.773314,5.68,0.0,0.0,0.0,4.12,2.23,4.03,0.0,0.0,5.41
7,8,35,College football,RELEVANT,0.770511,4.24,2.45,5.78,4.76,2.6,2.14,3.48,4.21,3.72,2.17
8,9,13,List of career achievements by LeBron James,RELEVANT,0.769865,3.93,0.0,0.0,0.0,0.0,0.0,5.46,5.61,0.0,2.66
9,10,4,MLB.com,NON-RELEVANT,0.767931,2.97,0.0,0.0,3.89,0.0,0.0,3.63,2.38,0.0,4.64



Statistics:
  Total documents: 50
  Relevant in top 10: 5
  Average similarity (all): 0.4959
  Average similarity (relevant): 0.5652
  Average similarity (non-relevant): 0.4785


---
## **Save Results to Google Drive**


In [20]:
# Save complete analysis
csv_file = f"{output_path}/rocchio_complete_analysis.csv"
df_complete.to_csv(csv_file, index=False)
print(f"✓ Saved complete analysis: {csv_file}")

# Save top 5 features
features_file = f"{output_path}/top_5_features.csv"
df_features.to_csv(features_file, index=False)
print(f"✓ Saved top 5 features: {features_file}")

# Save 3 closest documents
closest_file = f"{output_path}/3_closest_documents.csv"
df_closest.to_csv(closest_file, index=False)
print(f"✓ Saved 3 closest documents: {closest_file}")

# Save detailed text summary
summary_file = f"{output_path}/rocchio_summary.txt"
with open(summary_file, 'w', encoding='utf-8') as f:
    f.write("ROCCHIO ALGORITHM - ASSIGNMENT RESULTS\n")
    f.write("=" * 70 + "\n\n")

    f.write("CONFIGURATION:\n")
    f.write(f"  Number of documents: {len(articles)}\n")
    f.write(f"  Vocabulary: {', '.join(SPORTS_TERMS)}\n")
    f.write(f"  Relevant documents: {np.sum(labels)} ({np.sum(labels)/len(labels)*100:.1f}%)\n\n")

    f.write("q_opt VECTOR:\n")
    for term, score in zip(SPORTS_TERMS, q_opt):
        f.write(f"  {term:<12}: {score:.4f}\n")

    f.write(f"\n\nTOP 5 MOST SIGNIFICANT FEATURES:\n")
    f.write("-" * 40 + "\n")
    for i, (term, score) in enumerate(top_features, 1):
        f.write(f"{i}. {term}: {score:.4f}\n")

    f.write("\n\nTOP 3 CLOSEST DOCUMENTS:\n")
    f.write("-" * 40 + "\n")
    for doc in closest_docs:
        f.write(f"\nRank {doc['Rank']}: {doc['Title']}\n")
        f.write(f"  Label: {doc['Label']}\n")
        f.write(f"  Similarity: {doc['Similarity']:.6f}\n")
        f.write(f"  Vector: {doc['Vector']}\n")

print(f"✓ Saved summary: {summary_file}")

print("\n" + "=" * 70)
print("✓ ALL RESULTS SAVED TO GOOGLE DRIVE")
print("=" * 70)
print(f"\n📁 Location: {output_path}")
print("\nFiles created:")
print("  1. rocchio_complete_analysis.csv")
print("  2. top_5_features.csv")
print("  3. 3_closest_documents.csv")
print("  4. rocchio_summary.txt")

✓ Saved complete analysis: /content/drive/MyDrive/Colab Notebooks/Information Retreival/Rocchio/rocchio_complete_analysis.csv
✓ Saved top 5 features: /content/drive/MyDrive/Colab Notebooks/Information Retreival/Rocchio/top_5_features.csv
✓ Saved 3 closest documents: /content/drive/MyDrive/Colab Notebooks/Information Retreival/Rocchio/3_closest_documents.csv
✓ Saved summary: /content/drive/MyDrive/Colab Notebooks/Information Retreival/Rocchio/rocchio_summary.txt

✓ ALL RESULTS SAVED TO GOOGLE DRIVE

📁 Location: /content/drive/MyDrive/Colab Notebooks/Information Retreival/Rocchio

Files created:
  1. rocchio_complete_analysis.csv
  2. top_5_features.csv
  3. 3_closest_documents.csv
  4. rocchio_summary.txt


---

### **Summary:**

✅ **Part A:** Created 50 documents with TF-IDF vectors (10 terms each)  
✅ **Part A:** Assigned 20% RELEVANT, 80% NON-RELEVANT labels  
✅ **Part B:** Calculated q_opt using Rocchio: `q_opt = 2*μ_R - μ_NR`  
✅ **Part C.1:** Displayed top 5 most significant features  
✅ **Part C.2:** Displayed 3 closest documents with vectors and labels  

**All results saved to Google Drive Folder**