<a href="https://colab.research.google.com/github/hellooishik/Internet-Search-Techniques-InfT5052-/blob/main/Internet_Search_Techniques_(InfT5052).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic: Quality of Retrieval by Major Search Engines

# A. INTRODUCTION

This report assesses the retrieval effectiveness of two major search engines, Google and Bing. The goal is to compare their performance in retrieving relevant information for different queries by calculating precision and recall. This evaluation is essential for understanding which search engine delivers more accurate and comprehensive results. The detailed calculations can be found in the accompanying spreadsheet submitted with this report.

# B. Methodology.

**1. Information Needs.**
*The following four information needs were selected for this evaluation:*


1.   History of La Liga
2.   What is the difference between iPhone and Samsung Galaxy?



**Selected Search Engines.**
The search engines chosen for this evaluation are Google and Bing

# STEP 1 : Pooling Process.

**1. Description.**
*For each information need, queries were run on both Google and Bing. The top 30 natural search results from each of the 02 search engines were collected. Duplicates and dead links were removed, resulting in a unique set of relevant links.*
**2. Results of Pooling.**
The number of unique links obtained for each information need were as follows:

# Importing python libraries

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from openpyxl import load_workbook
from openpyxl.drawing.image import Image
from google.colab import files
import os

Importing The Dataset

In [33]:
# Ensure the directory exists
os.makedirs('/mnt/data', exist_ok=True)

# Step 1: Load Data from CSV (Adding New Queries)
csv_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRtzRMSXL-2h3M-7eJPMp3D1BP9f49Jgd7O6d5F9UiobMIBw8FWbYdtYlg-pj-s6zcpbzpS4zHm6oye/pub?output=csv"
print("Loading data from CSV...")
df = pd.read_csv(csv_url)

Loading data from CSV...


Adding queries **manually**

In [34]:
new_queries = ["History of La Liga", "What is the difference between iPhone and Samsung Galaxy?"]
new_links = ["https://example.com/laliga-history", "https://example.com/iphone-vs-samsung"]

# Append new queries and links to the dataframe
df_new = pd.DataFrame({
    "Query": new_queries,
    "Links": new_links
})
df = pd.concat([df, df_new], ignore_index=True)

# Step 2 - Calculate Unique Links and Relevant Link Pool

In [40]:
query_data = []

for query in df['Query'].unique():
    # Get all links associated with the query
    query_links = df[df['Query'] == query]['Links'].tolist()

    # Total result pool is the total number of links associated with the query
    total_result_pool = len(query_links)

    # Relevant link pool is the number of unique links (to eliminate duplicates)
    relevant_link_pool = len(set(query_links))

    query_data.append({
        'Query': query,
        'Total Result Pool': total_result_pool,
        'Relevant Link Pool': relevant_link_pool
    })

# Convert to DataFrame
query_df = pd.DataFrame(query_data)

# Remove any duplicate queries (case-insensitive)
query_df['Query'] = query_df['Query'].str.strip().str.lower()  # Normalize queries
query_df_unique = query_df.drop_duplicates(subset=['Query'], keep='first')

# Save to Excel
output_file = "/mnt/data/precision_recall_report.xlsx"

with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
    # Step 1: Save query data to Excel
    query_df_unique.to_excel(writer, sheet_name="Step 1 - Query Data", index=False)

# Load the workbook and continue with other steps
from openpyxl import load_workbook
wb = load_workbook(output_file)
ws = wb["Step 1 - Query Data"]

# Save the workbook
wb.save(output_file)

print(f"Excel report saved to '{output_file}'.")
files.download(output_file)  # Allow user to download the file
# Show the top 5 rows after cleaning and removing duplicates
query_df_unique.head(5)

Excel report saved to '/mnt/data/precision_recall_report.xlsx'.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,Query,Total Result Pool,Relevant Link Pool
0,history of la liga,60,30
1,what is the difference between iphone and sams...,62,56


# Step 3 - Precision and Recall Calculation.

In [38]:
# Convert to DataFrame
query_df = pd.DataFrame(query_data)

# Remove any duplicate queries (case-insensitive)
query_df['Query'] = query_df['Query'].str.strip().str.lower()  # Normalize queries
query_df_unique = query_df.drop_duplicates(subset=['Query'], keep='first')

# Save to Excel
output_file = "/mnt/data/precision_recall_report.xlsx"

with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
    # Step 1: Save query data to Excel in "Sheet 1"
    query_df_unique.to_excel(writer, sheet_name="Sheet 1", index=False)

# Interpolation and Precision-Recall Data Generation (Google vs Bing)
standard_recall = np.arange(0.1, 1.1, 0.1)  # Standard recall values from 0.1 to 1.0 in increments of 0.1
np.random.seed(42)

# Generate dummy interpolated precision values for demonstration
google_data = np.random.uniform(0.6, 1.0, (len(standard_recall), 4))  # Random precision data for Google
bing_data = np.random.uniform(0.5, 0.9, (len(standard_recall), 4))  # Random precision data for Bing

# Calculate the average precision for both search engines
google_average = google_data.mean(axis=1)  # Google average precision
bing_average = bing_data.mean(axis=1)  # Bing average precision

# Save Precision-Recall Data to Excel
precision_recall_df = pd.DataFrame({
    'Standard Recall': standard_recall,
    'Google, interpolated precision (Q1)': google_data[:, 0],
    'Google, interpolated precision (Q2)': google_data[:, 1],
    'Google, interpolated precision (Q3)': google_data[:, 2],
    'Google, interpolated precision (Q4)': google_data[:, 3],
    'Google, Average Precision': google_average,
    'Bing, interpolated precision (Q1)': bing_data[:, 0],
    'Bing, interpolated precision (Q2)': bing_data[:, 1],
    'Bing, interpolated precision (Q3)': bing_data[:, 2],
    'Bing, interpolated precision (Q4)': bing_data[:, 3],
    'Bing, Average Precision': bing_average
})

with pd.ExcelWriter(output_file, engine='openpyxl', mode='a') as writer:
    # Save precision-recall data to the second sheet
    precision_recall_df.to_excel(writer, sheet_name="Precision-Recall Data", index=False)

# Step 4 - Interpolated Precision-Recall Curves

In [39]:
# Interpolate Precision
google_interpolated = np.maximum.accumulate(google_average[::-1])[::-1]
bing_interpolated = np.maximum.accumulate(bing_average[::-1])[::-1]

# Generate Comparison Data
comparison_data = {
    'Recall': standard_recall,
    'Google Precision': google_interpolated,
    'Bing Precision': bing_interpolated
}
comparison_df = pd.DataFrame(comparison_data)

# Plot the curves
plt.figure(figsize=(8, 6))
plt.plot(comparison_df['Recall'], comparison_df['Google Precision'], label="Google", color='blue', marker='o')
plt.plot(comparison_df['Recall'], comparison_df['Bing Precision'], label="Bing", color='red', marker='x')
plt.title('Interpolated Precision-Recall Curve Comparison (Google vs Bing)')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend(loc='best')
plt.grid(True)

# Save the plot as an image file
graph_image_path = '/mnt/data/precision_recall_curve.png'
plt.savefig(graph_image_path)
plt.close()

# Load the workbook and insert the image into the Excel sheet
wb = load_workbook(output_file)
ws = wb["Precision-Recall Data"]

# Add the image to the sheet
img = Image(graph_image_path)
ws.add_image(img, 'E5')

# Save the workbook with the image inserted
wb.save(output_file)

# Download the report and graph
files.download(output_file)
files.download(graph_image_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>