# **IMPARTANT NOTE**

1. In the file name of this note book "`CENG3530-Spring2023-FinalProject- STUDENTNO-NAME.ipynb`" <br> replace **your student no** with "`STUDENTNO`" <br>and **your full name** with "`NAME`".<br> e.g. `CENG3530-Spring2023-FinalProject-000000001-BekirTanerDincer.ipynb`

# CENG3530 2023 Spring Semester Final Project

This project requires you to index Milliyet100K test collection and execute the queries provided by the test collection and calculate the MAP score for the Apache Solr implementation of BM25.

The outline of the processes you need to implement and execute are divided into sections in this notebook.

# Install & Run Apache Solr

In [None]:
# Download the binary solr file from the apache.org
!wget -O solr-9.2.1.tgz https://www.apache.org/dyn/closer.lua/solr/solr/9.2.1/solr-9.2.1.tgz?action=download

In [None]:
# Extract the binary solr files
!tar -xzf solr-9.2.1.tgz

## Launch Solr

In [None]:
!cd solr-9.2.1/ && bin/solr start -c -force

RNG might not work properly. To check for the amount of available entropy, use 'cat /proc/sys/kernel/random/entropy_avail'.

Waiting up to 180 seconds to see Solr running on port 8983 [|]   [/]   [-]   [\]   [|]   [/]   [-]   [\]   [|]   [/]  
Started Solr server on port 8983 (pid=458). Happy searching!

    

## Check processes

In [None]:
!lsof -iTCP -sTCP:LISTEN

COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
node        7 root   21u  IPv6  19425      0t0  TCP *:8080 (LISTEN)
kernel_ma  33 root    3u  IPv4  19567      0t0  TCP c329e01bcda7:6000 (LISTEN)
colab-fil  61 root    3u  IPv4  19427      0t0  TCP localhost:3453 (LISTEN)
jupyter-n  82 root    7u  IPv4  20751      0t0  TCP c329e01bcda7:9000 (LISTEN)
python3   244 root   22u  IPv4  22212      0t0  TCP localhost:42773 (LISTEN)
python3   271 root    3u  IPv4  22503      0t0  TCP localhost:20054 (LISTEN)
python3   271 root    4u  IPv4  22504      0t0  TCP localhost:44539 (LISTEN)
java      458 root   46u  IPv4  26344      0t0  TCP localhost:7983 (LISTEN)
java      458 root   47u  IPv4  26345      0t0  TCP localhost:8983 (LISTEN)
java      458 root  162u  IPv4  28805      0t0  TCP localhost:9983 (LISTEN)


## Access the Solr Admin UI

In [None]:
!npm install -g localtunnel

In [None]:
get_ipython().system_raw('lt --port 8983 --subdomain ipysolr >> url.txt 2>&1 &')

In [None]:
!cat url.txt

your url is: https://ipysolr.loca.lt
your url is: https://ipysolr.loca.lt
your url is: https://ipysolr.loca.lt


In [None]:
import requests

def get_public_ip():
    response = requests.get('https://ipinfo.io/ip')
    if response.status_code == 200:
        return response.text.strip()
    else:
        return None

public_ip = get_public_ip()
print(f"Public IP: {public_ip}")


Public IP: 34.86.244.83


# Download & Extract Milliyet Test Collection

In [None]:
!gdown 1UtpaOIl1okLAzRAEnjDJn6Hp_fDGO6Az

In [None]:
!unzip /content/MilliyetCollection100K.zip -d /content

# Helper Functions

## Install Zstandard library
The files in the collection directory is compressed using the **Zstandard** compression algorithm.

https://facebook.github.io/zstd/

In [None]:
!pip install zstandard

## Install tqdm

In [None]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Index Milliyet test collection

## Read documents into list

In [None]:
import os
import zstandard as zstd
from bs4 import BeautifulSoup
from tqdm import tqdm

def list_files(dir_path):
    with os.scandir(dir_path) as entries:
        for entry in entries:
            if entry.is_file():
                yield os.path.join(dir_path, entry.name)

def documentReader(collection_path):
    dctx = zstd.ZstdDecompressor()
    for zstd_filename in tqdm(list_files(collection_path)):
        with open(zstd_filename, 'rb') as zstd_file:
            decompressed_data = dctx.decompress(zstd_file.read())
            file_content = decompressed_data.decode('utf-8')
            soup = BeautifulSoup(file_content, 'html.parser')
            docs = soup.find_all('doc')
            for doc in docs:
              docID = doc.find('docid').text
              text = doc.find('text').text.strip()
              yield docID, text

In [None]:
collection_path = "/content/MilliyetCollection100K"

documents = [] # [(docid, content)]
for docID, content in documentReader(collection_path):
  documents.append((docID, content))

50it [00:24,  2.05it/s]


## Prepare Solr for document indexing

### Create Collection

In [None]:
# Create milliyet collection
%%shell
curl --request POST \
--url http://localhost:8983/api/collections \
--header 'Content-Type: application/json' \
--data '{
  "create": {
    "name": "milliyet",
    "numShards": 1,
    "replicationFactor": 1
  }
}'

### Add custom field(s) for milliyet collection

In [None]:
#Create schema for the collection
%%shell
curl --request POST \
  --url http://localhost:8983/api/collections/milliyet/schema \
  --header 'Content-Type: application/json' \
  --data '{
  "add-field": [
    {"name": "content", "type": "text_general", "multiValued": false}
  ]
}'

## Index documents in document list

In [None]:
curl_statement = []
curl_statement.append("curl --request POST")
curl_statement.append("--url 'http://localhost:8983/api/collections/milliyet/update'")
curl_statement.append("--header 'Content-Type: application/json'")
for docID, content in tqdm(documents):
  data = """ '{"id" : "%s", "content" :"%s"}' """ % (docID, content.replace('"',' ').replace("'",' '))
  _curl_command_ = " ".join(curl_statement) +" --data " + data
#  print(_curl_command_)
  get_ipython().system_raw(_curl_command_)


100%|██████████| 100000/100000 [33:52<00:00, 49.19it/s]


## Set refresh/auto-commit interval

In [None]:
%%shell
curl -X POST -H 'Content-type: application/json' \
--URL http://localhost:8983/api/collections/milliyet/config \
--data '{"set-property":{"updateHandler.autoCommit.maxTime":3000}}' 

# Run queries in batch form and calculate MAP

In [None]:
# Download Queries
!gdown 1DTstA3IbtuGSKOuAGQoJRCR93_21tTiC

In [None]:
# Downlaod qrels
!gdown 1QSe4Q2Q63b4K0_91ZBQny3qniptWqWIc

## Example Search

In [None]:
from urllib.request import urlopen
import json

connection = urlopen('http://localhost:8983/solr/milliyet/select?df=content&wt=json&q=ankara')
response = json.load(connection)

print(response['response']['numFound'], "documents found.")

# Print the docid of each document.
for rank, document in enumerate(response['response']['docs']):
    print(f'{rank+1}\t {document["id"]}')

13577 documents found.
1	 Milliyet_0105_v00_207254
2	 Milliyet_0105_v00_35173
3	 Milliyet_0105_v00_135568
4	 Milliyet_0105_v00_233191
5	 Milliyet_0105_v00_1671
6	 Milliyet_0105_v00_391439
7	 Milliyet_0105_v00_211529
8	 Milliyet_0105_v00_35150
9	 Milliyet_0105_v00_195405
10	 Milliyet_0105_v00_129343


## Read test queries & create result lists

### Read queries

In [None]:
query_path = "/content/MilliyetCollectionQueries.txt"

with open(query_path, "r") as qfile:
  qlines = qfile.readlines()

queries = [line.replace("\n","").split("\t")[:2] for line in qlines[1:]]
print(f"Number of queries: {len(queries)}")
print(queries)

Number of queries: 72
[['235', 'Kuş Gribi'], ['238', 'Kıbrıs Sorunu'], ['241', 'Üniversiteye giriş sınavı'], ['243', 'Tsunami'], ['244', 'Mavi Akım Doğalgaz Projesi'], ['258', 'Deprem Tedbir Önlem'], ['265', 'Türkiye PKK çatışmaları'], ['270', 'Film Festivalleri'], ['271', 'Bedelli askerlik uygulaması'], ['278', 'Stresle Başa Çıkma Yolları '], ['282', 'Şampiyonlar Ligi'], ['283', '17 Ağustos Depremi'], ['284', "Türkiye'de internet kullanımı"], ['288', 'Amerika Irak işgal demokrasi petrol'], ['289', "Türkiye'de futbol şikesi"], ['294', 'Fadıl Akgündüz'], ['295', 'İşsizlik sorunu'], ['296', '2005 F1 Türkiye Grand Prix'], ['298', 'Ekonomik kriz'], ['300', 'Nuri Bilge Ceylan'], ['301', "Türkiye'de meydana gelen depremler"], ['302', 'ABD-Irak Savaşı'], ['304', "Hakan Şükür'ün milli takım kadrosuna alınmaması"], ['305', 'Avrupa Birliği, Türkiye ve insan hakları'], ['306', 'Turizm'], ['307', 'Türkiye’deki sokak çocukları'], ['308', 'Türk filmleri ve sineması'], ['311', 'Pakistan Depremi'], ['

### Submit each query and create result lists

**Challange**

Here, the solr returns for any query, a document list of size equal to 10 at most, dispite the fact that it has found much more documents.

Configure the URL so that the result set size be at most 100 not 10.

In [None]:
from urllib.request import urlopen
from urllib.parse import quote
import json
# queries = [[qid, topic], ...]
ranked_lists = {} # {query_id:ranked_list, ...}
for query_id, topic in queries:
  url = 'http://localhost:8983/solr/milliyet/select?df=content&wt=json&q=%s' % quote(topic)
  connection = urlopen(url)
  response = json.load(connection)
  result_list = []
  for document in response['response']['docs']:
    result_list.append(document['id'])
  ranked_lists[query_id] = result_list
  print(f"For query {query_id}, {response['response']['numFound']} documents found, {len(result_list)} documents returned.")


# Calculate MAP score over all queries