In [1]:
# NOTE: To get files for this assignment # 2, run inverted index 5.0 and get files from there.

# Overview

In this assignment, you will use the index you created in Assignment 1 to rank documents 
and create a search engine. You will implement two different scoring functions and compare 
their results against a baseline ranking produced by expert analysts.

# Running Queries


For this assignment, you will need the following two files:

<font color=red>  </font> <font color=blue> topics.xml (\\sandata\xeon\Maryam Bashir\Information Retrieval\topics.xml) </font> contains the queries you will be testing. 

You should run the queries using the text stored in the <font color=green> query </font> elements. The <font color=green> description </font> elements are only there to clarify the information need which the query is trying to express</font> .


<font color=red>  </font> <font color=blue> corpus.qrel (\\sandata\xeon\Maryam Bashir\Information Retrieval\corpus.qrel)</font> contains the relevance grades from expert assessors. While these grades are not necessarily entirely correct (and defining correctness unambiguously is quite difficult), they are fairly reliable and we will treat them as being correct here. 

The format here is:
<font color=green> topic </font> <font color=green> 0 </font> <font color=green> docid </font> <font color=green> grade </font>

<font color=red> o </font> <font color=green> topic </font> is the ID of the query for which the document was assessed.

<font color=red> o </font> <font color=green> 0 </font> is part of the format and can be ignored.

<font color=red> o </font> <font color=green> docid </font> is the name of one of the documents which you have indexed.

<font color=red> o </font> <font color=green> grade </font> is a value in the set <font color=blue> {-2, 0, 1, 2, 3, 4} </font>, where a higher value means that the document is more relevant to the query. 
The value -2 indicates a spam document, and 0 indicates a non-spam document which is completely non- relevant. 
Most queries do not have any document with a grade of 4, and many queries do not have any document with a grade of 3.
This is a consequence of the specific meaning assigned to these grades here and the manner in which the documents were collected.

This <font color=green> QREL </font> does not have assessments for every  <font color=blue>(query, document) </font> pair. If an assessment is missing, we assume the correct grade for the pair is 0 (non-relevant).

You will write a program which takes the name of a scoring function as a command line argument and which prints a ranked list of documents for all queries found in topics.xml using that scoring function. 

For example:

<font color=red> $ </font>  <font color=green> ./query.py --score TF-IDF </font> 

<font color=blue> 202 clueweb12-0000tw-13-04988 1 0.73 run1 </font> 

<font color=blue> 202 clueweb12-0000tw-13-04901 2 0.33 run1 </font>  

<font color=blue> 202 clueweb12-0000tw-13-04932 3 0.32 run1 </font>  ...

<font color=blue> 214 clueweb12-0000tw-13-05088 1 0.73 run1 </font> 

<font color=blue> 214 clueweb12-0000tw-13-05001 2 0.33 run1 </font> 

<font color=blue> 214 clueweb12-0000tw-13-05032 3 0.32 run1 </font> ...

<font color=blue> 250 clueweb12-0000tw-13-05032 500 0.002 run1 </font>


The output should have one row for each document which your program ranks for each query it runs. 
These lines should have the format:

<font color=green> topic </font> <font color=green> docid </font> <font color=green> rank </font> <font color=green> score </font> <font color=green> run </font>

<font color=red>  </font> <font color=green> topic </font> is the ID of the query for which the document was ranked.

<font color=red>  </font> <font color=green> docid </font> is the document identifier.

<font color=red>  </font> <font color=green> rank </font> is the order in which to present the document to the user. The document with the highest score will be assigned a rank of 1, the second highest a rank of 2, and so on.

<font color=red>  </font> <font color=green> score </font> is the actual score the document obtained for that query.

<font color=red>  </font> <font color=green> run </font> is the name of the run. You can use any value here. It is meant to allow research teams to submit multiple runs for evaluation in competitions such as TREC.

In [477]:
from bs4 import BeautifulSoup
import re
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize
import random
import os
import operator
import xml.dom.minidom
import numpy as np
import math
#from sets import Set
#from html.parser import HTMLParser

In [438]:
def get_directory_path(mode):
    """
    It takes only path of folder, no file name.
    It only returns the folder which contain all the text file.
    
    Argument:
    #nothing
    
    Returns:
    dp -- directory path which contains all the txt files.
    """
    if (mode == "input"):   
        dp = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input"
    elif (mode == "output"):
        dp = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/out"
    else:
        raise ValueError('Unspecified mode for I/O.')
        dp = None

    return dp

In [439]:
# Function : read_stop_list
def read_text_in_list_form(file_path):
    """
    This function takes the path of stop words file and reads it and returns a list of words.
    
    Argument:
    stop_file_path -- path should be like: path + file name.extension
        "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/stoplist.txt".
    
    Returns:
    lineList -- list of words containg all the stop_words.
    """
    
    lst = [line.rstrip('\n') for line in open(file_path)]
    return lst

In [440]:
stop_word_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/stoplist.txt"
stop_words = read_text_in_list_form(stop_word_file)

In [441]:
# # Function : To open term_ids with terms
# def open_inverted_index(file_name):
#     tokensDict = dict()
#     i = 0
#     file = open(file_name,'r',encoding = "utf-8")
#     for value in file:
#         value = value.split("/")
#         value[1] = value[1].strip("\n")
#         tokensDict[str(value[1])] = str(value[0])
#         i = i + 1
#     file.close()
#     return tokensDict

In [442]:
# def xml_parser(file_name):
#     doc = xml.dom.minidom.parse(file_name)
#     qrys = doc.getElementsByTagName('query')
#     tpcs = doc.getElementsByTagName('topic')
#     queries_list = list()
#     #qrys = mydoc.getElementsByTagName('topic')
#     i = 0
#     for elem in qrys:
#         #print(tpcs[i].attributes['number'].value)
#         queries_list.append(elem.firstChild.data)
#         i = i + 1
#         #print(elem.firstChild.data)
#     #print(queries_list)

## Version 2.0 using dictionary
def xml_parser(file_name):
    doc = xml.dom.minidom.parse(file_name)
    qrys = doc.getElementsByTagName('query')
    tpcs = doc.getElementsByTagName('topic')
    queries = dict()
    #qrys = mydoc.getElementsByTagName('topic')
    i = 0
    for elem in qrys:
        queries[tpcs[i].attributes['number'].value] = elem.firstChild.data
        #queries_list.append(elem.firstChild.data)
        i = i + 1
        #print(elem.firstChild.data)
    #print(queries)
    return queries

In [443]:
file_name = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/topics.xml"
qrys = xml_parser(file_name)
#print(qrys[str(202)])
print(qrys)

{'202': 'uss carl vinson', '214': 'capital gains tax rate', '216': 'nicolas cage movies', '221': 'electoral college 2008 results', '227': 'i will survive lyrics', '230': "world's biggest dog", '234': 'dark chocolate health benefits', '243': 'afghanistan flag', '246': 'civil war battles in South Carolina', '250': 'ford edge problems'}


In [444]:
# def read_relevance_judgements(file_name):
#     file = open(file_name,'r',encoding = "utf-8")
#     for line in file:
#         splited = line.split()
#         #print(type(x))
#         #print(x)
    
        
        
#         #print(file_line)
    
    
# #     print(file_line)
# #     print(file_line[0])
# #     print(file_line[1])
# #     print(file_line[2])
# #     print(file_line[3])
# #     file_line = file.readline()
# #     print(file_line)
    
    

In [445]:
# file_name = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/relevance judgements.qrel"
# read_relevance_judgements(file_name)

# Query Processing

Before running any scoring function, you should process the text of the query in exactly the same way that you processed the text of a document. That is:
1. Split the query into tokens (it is most correct to use the regular expression, but for these queries it suffices to split on whitespace)
2. Convert all tokens to lowercase
3. Apply stop-wording to the query using the same list you used in assignment 1
4. Apply the same stemming algorithm to the query which you used in your indexer

In [446]:
def stem_words(tokenized_words_without_stop_words):
    """
    This function takes in list of words which do not contain stop_words.
    It uses the PorterStemmer() to reduce the word to their root words.
    
    Argument:
    removed_all_stop_words -- list of all words which do not have stop_words.
    
    Returns:
    stemmed_words -- list of words which are reduced to their origin word.
    """
    ps = PorterStemmer()
    stemmed_words = list()
    for w in tokenized_words_without_stop_words:
        stemmed_words.append(ps.stem(w))
    stemmed_words.sort()
    return stemmed_words

In [447]:
def query_processing(query_string):
    #splited_query = list(re.split(query))
    splited_query = list(query_string.split())
    #splited_query.lower()
    cleaned_tokens_from_stop_words =  list(set(splited_query) - set(stop_words))
    stemmed_tokens_of_words = stem_words(cleaned_tokens_from_stop_words)
    return stemmed_tokens_of_words

In [448]:
#print(type(qrys[str(202)]))
x = query_processing(qrys[str(202)])
print((x))
print(len(x))

['carl', 'uss', 'vinson']
3


# Scoring Function 1: Okapi BM25

Implement BM25 scores. This should use the following scoring function for document d and query q:
    
Where k1,k2, and b are constants. For start, you can use the values suggested in the lecture on BM25 (k1 = 1.2, k2 varies from 0 to 1000, b = 0.75). Feel free to experiment with different values for these
constants to learn their effect and try to improve performance.

In [449]:
def get_all_documents_length(docid_hashed_file):
    doc_lengths = dict()
    i = 0
    file = open(docid_hashed_file,'r',encoding = "utf-8")
    
    for each_line in file:
        x = each_line.split()
        doc_lengths[x[0]] = x[2]
#        doc_frequency[each_line]
        #print(x[0])
    return doc_lengths 
    

In [450]:
docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
all_doc_lengths = get_all_documents_length(docid_hashed_file)
#c = int(all_doc_lengths[str(3058)])
#print(int(all_doc_lengths[str(3058)]))
print((all_doc_lengths))

{'1059': '320', '1533': '390', '6941': '240', '299': '593', '234': '467', '4776': '485', '2658': '855', '6057': '466', '2649': '304', '5256': '701', '3718': '1086', '1885': '202', '4103': '308', '93': '39', '1171': '349', '2410': '233', '7212': '120'}


In [451]:
def get_all_documents_name(docid_hashed_file):
    doc_names = dict()
    i = 0
    file = open(docid_hashed_file,'r',encoding = "utf-8")
    
    for each_line in file:
        x = each_line.split()
        doc_names[x[0]] = x[1]
#        doc_frequency[each_line]
        #print(x[0])
    return doc_names

In [452]:
docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
all_doc_names = get_all_documents_name(docid_hashed_file)
print((all_doc_names))

{'1059': 'clueweb12-1202wb-26-10513', '1533': 'clueweb12-1102wb-73-18046', '6941': 'clueweb12-1505wb-68-30103', '299': 'clueweb12-0303wb-53-27200', '234': 'clueweb12-1905wb-44-08158', '4776': 'clueweb12-1012wb-63-19337', '2658': 'clueweb12-1118wb-77-23080', '6057': 'clueweb12-0800tw-39-05237', '2649': 'clueweb12-0211wb-75-04122', '5256': 'clueweb12-0001wb-96-10862', '3718': 'clueweb12-0200wb-25-11228', '1885': 'clueweb12-1905wb-14-19033', '4103': 'clueweb12-1302wb-14-07756', '93': 'clueweb12-1101wb-78-26737', '1171': 'clueweb12-1705wb-99-01272', '2410': 'clueweb12-1504wb-72-14179', '7212': 'clueweb12-1002wb-18-28831'}


In [453]:
def get_all_vocablury(file_name):
    vocablury = dict()
    file = open(file_name,'r',encoding = "utf-8")
    for each_line in file:
        x = each_line.split()
        #Interesting: making term as key and term_id as value (;
        vocablury[x[1]] = x[0]
    return vocablury

In [454]:
voc_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/termid_hashed.txt"
vocablury = get_all_vocablury(voc_file)
#print(vocablury)

In [455]:
#### version 1.0
# def get_document_postings(document_postings_file):
#     doc_freq = dict()
#     file = open(document_postings_file,'r',encoding = "utf-8")
#     #print("to")
#     for each_line in file:
#         x = each_line.split('\t')
#         lineList = [line.rstrip('\n') for line in x]
#         doc_freq[x[0]] = lineList
#     return doc_freq

#### from below here It was next cell in version 1.0

# doc_postings_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/document_postings.txt"
# dict_termid_with_docs_postings = get_document_postings(doc_postings_file)
# #print(dict_termid_with_docs_postings)

# #print(dict_termid_with_docs_postings)
# for kys, values in dict_termid_with_docs_postings.items():
#     #print( dict_termid_with_docs_postings[kys])
#     c = dict_termid_with_docs_postings[kys]
#     z = c[2].strip("\n")
#     x = z.strip("[]")
#     s = x.split(",")
#     print((s))


In [456]:
#### version 2.0
def get_document_postings(document_postings_file):
    termid_with_doc_postings = dict()
    file = open(document_postings_file,'r',encoding = "utf-8")
    for each_line in file:
        x = (re.split("\n",each_line))
        #print("x = " , (x), " len of x = " ,len(x))
        y = (re.split("\t", x[0]))
        #print("y = " , (y), " len of y = " ,len(y))
        
        temp_str = y[(len(y)-1)]
        temp_str = temp_str.strip("[]")
        temp_str = temp_str.replace(" ", "")
        
        lst = list()
        lst = temp_str.split(",")
        
        current_term_id = y[0]
        # create a key in dict_with_docid_and_its_positions
        termid_with_doc_postings[current_term_id] = lst
    return termid_with_doc_postings

In [457]:
doc_postings_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/document_postings.txt"
dict_termid_with_docs_postings = get_document_postings(doc_postings_file)
#print(dict_termid_with_docs_postings)

In [458]:
# version 1.0
# def get_term_frequency(term_index_hashed_file):
#     dict_with_termid_and_its_occurance = dict()
#     file = open(term_index_hashed_file, 'r' , encoding = "utf-8")
#     length = 0
#     current_term_id = 0
#     for each_line in file:
#         #print("each_line = ", (each_line))
#         x = (re.split("\n",each_line))
#         #print("x = " , (x), " len of x = " ,len(x))
#         y = (re.split("\t", x[0]))
        
#         if y[0] == '':
#             y = y[1:]
#         #print(("y = " ,y ))
#         #print("Value at y[len]-1 = " , (y[(len(y)-1)]), " len(y[(len(y)-1)]) = ", len(y[(len(y)-1)]))
        
#         # It means a New term_id aya hai, tou new dict ki key bnani
#         if (len(y) == 4):
#             temp_str = y[(len(y)-1)]
#             temp_str = temp_str.strip("[]")
#             lst = list()
#             lst = temp_str.split(",")
#             #print("temp_str in IF = ", type(temp_str))
#             #print("lst in IF = ", lst, " len(lst) = ", len(lst))
            
#             # reset length variable
#             length = 0
#             # new term_id now found
#             current_term_id = y[0]
#             # create a key in dict_with_termid_and_its_occurance
#             dict_with_termid_and_its_occurance[current_term_id ] = int()
            
#             length = len(lst) #len((y[(len(y)-1)]))
#             dict_with_termid_and_its_occurance[current_term_id] = length
            
#         elif (len(y) == 3):
            
#             temp_str = y[(len(y)-1)]
#             temp_str = temp_str.strip("[]")
#             lst = list()
#             lst = temp_str.split(",")
#             #print("temp_str in ELSE = ", type(temp_str))
#             #print("lst in ELSE = ", lst, " len(lst) = ", len(lst))
            
#             length += len(lst) #len(y[(len(y)-1)])
#             dict_with_termid_and_its_occurance[current_term_id] = length
#         else:
#             print("\nI Don't know what to do.\n")
#         #print('\n')    
#     return dict_with_termid_and_its_occurance   
#         #y[3] = y[3].strip("[]")
        
#         #y = str(y)
#         #y = y.strip("''")
#         #y = y.strip("[]")
#         #print("y = " , (y), " len of y = " ,len(y))
#         #z = y.split(",")
        
#         #if z[0] == "''":
#          #   z = z[1:]
#         #- same term_id
#         #- case 1: agar eik hi doc mein zyada pos hon
#         #- case 2: agar eik doc mein 1 hi pos hon
        
#         #jab naya term_id aye to doc_contaienr wali new dict bnani ha
#         #if (len(z)) == 3:
        
#         # z[0] =  per term_id hoga ya doc_id
#         # z[1] = per hamesha]
            
#         #print("z = " , (z), " len of z = " ,len(z))
#         #term_freq[x[0]] = x[2]
#     #return term_freq

In [483]:
# version 2.0
def get_term_frequency(term_index_hashed_file):
    nested_dict_with_termid_and_its_docs_and_occurance = dict()
    doc_id_with_positions = dict()
    file = open(term_index_hashed_file, 'r' , encoding = "utf-8")
    length = 0
    current_term_id = 0
    for each_line in file:
        #print("each_line = ", (each_line))
        x = (re.split("\n",each_line))
        #print("x = " , (x), " len of x = " ,len(x))
        y = (re.split("\t", x[0]))
        
        if y[0] == '':
            y = y[1:]
        #print(("y = " ,y ))
        #print("Value at y[len]-1 = " , (y[(len(y)-1)]), " len(y[(len(y)-1)]) = ", len(y[(len(y)-1)]))
        
        # It means a New term_id aya hai, tou dict ki new key bnani
        if (len(y) == 4):
            temp_str = y[(len(y)-1)]
            temp_str = temp_str.strip("[]")
            temp_str = temp_str.replace(" ", "")
            lst = list()
            lst = temp_str.split(",")
            #print("temp_str in IF = ", type(temp_str))
            #print("lst in IF = ", lst, " len(lst) = ", len(lst))
            doc_id_with_positions = dict()
            # reset length variable
            length = 0
            # new term_id now found
            current_term_id = y[0]
            # create a key in dict_with_docid_and_its_positions
            doc_id_with_positions[y[1]] = lst
            
            #length = len(lst) #len((y[(len(y)-1)]))
            nested_dict_with_termid_and_its_docs_and_occurance[current_term_id] = doc_id_with_positions
            
        elif (len(y) == 3):
            
            temp_str = y[(len(y)-1)]
            temp_str = temp_str.strip("[]")
            temp_str = temp_str.replace(" ", "")
            lst = list()
            lst = temp_str.split(",")
            #print("temp_str in ELSE = ", type(temp_str))
            #print("lst in ELSE = ", lst, " len(lst) = ", len(lst))
            
            doc_id_with_positions[y[0]] = lst
            #length += len(lst) #len(y[(len(y)-1)])
            nested_dict_with_termid_and_its_docs_and_occurance.update({current_term_id:doc_id_with_positions})
        else:
            print("\nI Don't know what to do.\n")
        #print('\n')    
    return nested_dict_with_termid_and_its_docs_and_occurance

In [484]:
hashed_term_id = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/term_index_hashed.txt"
dict_term_id_with_frequencies = get_term_frequency(hashed_term_id)

print((dict_term_id_with_frequencies[str(583007)]))
#print((dict_term_id_with_frequencies[str(583007)][str(5256)]))
#print(len(dict_term_id_with_frequencies[str(583007)][str(5256)]))
#print(dict_term_id_with_frequencies)

{'1059': ['258', '259'], '1533': ['288'], '6941': ['199'], '299': ['446'], '234': ['360', '361'], '2658': ['706'], '5256': ['553', '554'], '3718': ['904', '905'], '4103': ['245']}


In [461]:
## Version 1.0
# ## Sort-> how ??
# ## Scores > 0 ?? 
# ## Next ??
# def calculate_okapi_bm25():
#     k1 = 1.2 
#     k2 = 300
#     b = 0.75
#     D = 17
#     tf_q_i = 1
    
#     queries_file_name = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/topics.xml"
#     queries_dict = xml_parser(queries_file_name) #returned -> dict
    
#     doc_info_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_lengths = get_all_documents_length(doc_info_file)
    
#     docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_names = get_all_documents_name(docid_hashed_file)
    
#     voc_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/termid_hashed.txt"
#     vocablury = get_all_vocablury(voc_file)
    
#     docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_names = get_all_documents_name(docid_hashed_file)
    
#     doc_postings_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/document_postings.txt"
#     dict_termid_with_docs_postings = get_document_postings(doc_postings_file)
    
#     cleared_dict_termid_with_docs_postings = dict()
#     for kys, values in dict_termid_with_docs_postings.items():
#     #print( dict_termid_with_docs_postings[kys])
#         c = dict_termid_with_docs_postings[kys]
#         z = c[2].strip("\n")
#         x = z.strip("[]")
#         s = x.split(",")
#         cleared_dict_termid_with_docs_postings[kys] = s
    
#     inverted_index_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/term_index_hashed.txt"
#     dict_term_id_with_frequencies = get_term_frequency(inverted_index_file)
    
#     avg_length = 0
#     for doc_id, length in all_doc_lengths.items():
#         avg_length += int(length)
#         #print((avg_length))
#     avg_length /= len(all_doc_lengths) 
       
#     scores_dictionary = dict()    
#     #capital_K = k1 * ((1-b) + (b * (len(d)/avgr_d)))
#     for query_id, query in queries_dict.items(): # run for 10 times
#         splitted_query = query_processing(query)
#         # Ab wo documents aa jaien jin mein ye term id hai.
#         # split kerne k bad, har term ka doc k sath score nikalna hai
#         score_of_each_term_for_single_doc = 0
#         accumulated_score_for_query = 0
#         # query ki eik term, phr dosri, phr teesri and so on.
#         for i in range(0,len(splitted_query)):
#             accumulated_score_for_one_term_in_multiple_docs = 0
#             #print(splitted_query[i])
#             # eik term pakar li
#             # ab dekhna hai wo term kitne docs mein mujood hai
#             #lkn pehle us term ka term_id le leya jae
#             docs_score_for_each_query = dict()
#             if splitted_query[i] in vocablury: # if that term exists
#                 #then get its term_id
#                 term_id = vocablury[splitted_query[i]]
#                 # ab dekhna hai ye term ktne docs mein mujood hai
#                 #print(term_id)
#                 list_of_all_docs_in_which_term_exists = cleared_dict_termid_with_docs_postings[str(term_id)]
#                 #print((list_of_all_docs_in_which_term_exists))
#                 #print(len(list_of_all_docs_in_which_term_exists))
#                 df_i = len(list_of_all_docs_in_which_term_exists)
#                 #tf_q_i = 1
#                 tf_d_i = int(dict_term_id_with_frequencies[str(term_id)])
#                 #doc_score_of_each_term_in_query_in_docs = dict()
#                 # Eik term ki document wise.
#                 for i in range(0,len(list_of_all_docs_in_which_term_exists)):
#                     # wo document utha leya jisme ye term mujood hai.
#                     doc_id = list_of_all_docs_in_which_term_exists[i]
#                     #print(doc_id)
#                     if doc_id in all_doc_lengths: #security check if that doc_id is present in my doc_postings file
#                         length_of_doc_id = int(all_doc_lengths[str(doc_id)])
#                         capital_K = k1 * ((1-b) + (b * (length_of_doc_id/avg_length)))
#                         score_of_each_term_for_single_doc = (float((math.log((D+0.5))/float(df_i+0.5))) * float(((float(1+k1)*tf_d_i)/float(capital_K+tf_d_i))) * float((((1+k2)*float(tf_q_i)/(k2+tf_q_i)))))
#                         #accumulated_score_for_one_term_in_multiple_docs += score_of_each_term_for_single_doc
#                         doc_name = all_doc_names[str(doc_id)]
#                     #docs_score_for_each_query[str(doc_id)] = score_of_each_term_for_single_doc
#                         #check if doc_id key is already present, kia pata ksi term k ne us doc per
#                         # pehle apni value rakhwai ho
#                         if str(doc_id) in docs_score_for_each_query: #pehle se mujood hai
#                             prev_score = docs_score_for_each_query[str(doc_id)]
#                             new_score = prev_score + score_of_each_term_for_single_doc
#                             docs_score_for_each_query[doc_name] = new_score
#                         else:
#                             docs_score_for_each_query[str(doc_id)] = int()
#                             docs_score_for_each_query[str(doc_id)] = score_of_each_term_for_single_doc
#                     else:
#                         docs_score_for_each_query[str(doc_id)] = 0
#                 #accumulated_score_for_query += accumulated_score_for_one_term_in_multiple_docs
#             else: 
#                 docs_score_for_each_query[str(doc_id)] = 0
#                 #print(docs_score_for_each_query)
                    
#             scores_dictionary[query_id] = dict()
#             scores_dictionary[query_id] = docs_score_for_each_query
#     return scores_dictionary

In [462]:
# ### version 2.0
# ## Sort-> how ??
# ## Scores > 0 ?? 
# ## Next ??
# def calculate_okapi_bm25():
#     k1 = 1.2 
#     k2 = 300
#     b = 0.75
#     D = 17
#     tf_q_i = 1
    
#     queries_file_name = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/topics.xml"
#     queries_dict = xml_parser(queries_file_name) #returned -> dict
    
#     doc_info_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_lengths = get_all_documents_length(doc_info_file)
    
#     docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_names = get_all_documents_name(docid_hashed_file)
    
#     voc_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/termid_hashed.txt"
#     vocablury = get_all_vocablury(voc_file)
    
#     docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_names = get_all_documents_name(docid_hashed_file)
    
#     doc_postings_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/document_postings.txt"
#     dict_termid_with_docs_postings = get_document_postings(doc_postings_file)
    
#     cleared_dict_termid_with_docs_postings = dict()
#     for kys, values in dict_termid_with_docs_postings.items():
#     #print( dict_termid_with_docs_postings[kys])
#         c = dict_termid_with_docs_postings[kys]
#         z = c[2].strip("\n")
#         x = z.strip("[]")
#         s = x.split(",")
#         cleared_dict_termid_with_docs_postings[kys] = s
    
#     inverted_index_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/term_index_hashed.txt"
#     dict_term_id_with_docs_and_positions = get_term_frequency(inverted_index_file)
    
#     avg_length = 0
#     for doc_id, length in all_doc_lengths.items():
#         avg_length += int(length)
#         #print((avg_length))
#     avg_length /= len(all_doc_lengths) 
       
#     scores_dictionary = dict()    
#     #capital_K = k1 * ((1-b) + (b * (len(d)/avgr_d)))
#     for query_id, query in queries_dict.items(): # run for 10 times
#         splitted_query = query_processing(query)
#         # Ab wo documents aa jaien jin mein ye term id hai.
#         # split kerne k bad, har term ka doc k sath score nikalna hai
#         score_of_each_term_for_single_doc = 0
#         accumulated_score_for_query = 0
#         # query ki eik term, phr dosri, phr teesri and so on.
#         for i in range(0,len(splitted_query)):
#             accumulated_score_for_one_term_in_multiple_docs = 0
#             #print(splitted_query[i])
#             # eik term pakar li
#             # ab dekhna hai wo term kitne docs mein mujood hai
#             #lkn pehle us term ka term_id le leya jae
#             docs_score_for_each_query = dict()
#             if splitted_query[i] in vocablury: # if that term exists
#                 #then get its term_id
#                 term_id = vocablury[splitted_query[i]]
#                 # ab dekhna hai ye term ktne docs mein mujood hai
#                 #print(term_id)
#                 list_of_all_docs_in_which_term_exists = cleared_dict_termid_with_docs_postings[str(term_id)]
#                 #print((list_of_all_docs_in_which_term_exists))
#                 #print(len(list_of_all_docs_in_which_term_exists))
#                 df_i = len(list_of_all_docs_in_which_term_exists)
#                 #tf_q_i = 1
                
#                 #doc_score_of_each_term_in_query_in_docs = dict()
#                 # Eik term ki document wise.
#                 for i in range(0,len(list_of_all_docs_in_which_term_exists)):
#                     # wo document utha leya jisme ye term mujood hai.
#                     doc_id = list_of_all_docs_in_which_term_exists[i]
#                     #print(doc_id)
#                     if doc_id in all_doc_lengths: #security check if that doc_id is present in my doc_postings file
#                         tf_d_i = len(dict_term_id_with_docs_and_positions[str(term_id)][str(doc_id)])
#                         #print("tf_d_i", tf_d_i)
#                         length_of_doc_id = int(all_doc_lengths[str(doc_id)])
#                         capital_K = k1 * ((1-b) + (b * (length_of_doc_id/avg_length)))
#                         score_of_each_term_for_single_doc = (float((math.log((D+0.5))/float(df_i+0.5))) * float(((float(1+k1)*tf_d_i)/float(capital_K+tf_d_i))) * float((((1+k2)*float(tf_q_i)/(k2+tf_q_i)))))
#                         #accumulated_score_for_one_term_in_multiple_docs += score_of_each_term_for_single_doc
#                         doc_name = all_doc_names[str(doc_id)]
#                     #docs_score_for_each_query[str(doc_id)] = score_of_each_term_for_single_doc
#                         #check if doc_id key is already present, kia pata ksi term k ne us doc per
#                         # pehle apni value rakhwai ho
#                         if str(doc_id) in docs_score_for_each_query: #pehle se mujood hai
#                             prev_score = docs_score_for_each_query[str(doc_id)]
#                             new_score = prev_score + score_of_each_term_for_single_doc
#                             docs_score_for_each_query[doc_name] = new_score
#                         else:
#                             docs_score_for_each_query[str(doc_id)] = int()
#                             docs_score_for_each_query[str(doc_id)] = score_of_each_term_for_single_doc
#                     else:
#                         docs_score_for_each_query[str(doc_id)] = 0
#                 #accumulated_score_for_query += accumulated_score_for_one_term_in_multiple_docs
#             else: 
#                 docs_score_for_each_query[str(doc_id)] = 0
#                 #print(docs_score_for_each_query)
                    
#             scores_dictionary[query_id] = dict()
#             scores_dictionary[query_id] = docs_score_for_each_query
#     return scores_dictionary

In [463]:
# ### build version 2.1 : removed unnecessary comments & print statements, and added proper comments
# def calculate_okapi_bm25():
#     k1 = 1.2 
#     k2 = 300
#     b = 0.75
#     D = 17
    
#     queries_file_name = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/topics.xml"
#     queries_dict = xml_parser(queries_file_name) 
    
#     doc_info_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_lengths = get_all_documents_length(doc_info_file)
    
#     docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_names = get_all_documents_name(docid_hashed_file)
    
#     voc_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/termid_hashed.txt"
#     vocablury = get_all_vocablury(voc_file)
    
#     docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_names = get_all_documents_name(docid_hashed_file)
    
#     doc_postings_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/document_postings.txt"
#     dict_termid_with_docs_postings = get_document_postings(doc_postings_file)
    
#     cleared_dict_termid_with_docs_postings = dict()
#     for kys, values in dict_termid_with_docs_postings.items():
#         c = dict_termid_with_docs_postings[kys]
#         z = c[2].strip("\n")
#         x = z.strip("[]")
#         s = x.split(",")
#         cleared_dict_termid_with_docs_postings[kys] = s
    
#     inverted_index_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/term_index_hashed.txt"
#     dict_term_id_with_docs_and_positions = get_term_frequency(inverted_index_file)
    
#     avg_length = 0
#     for doc_id, length in all_doc_lengths.items():
#         avg_length += int(length)
#     avg_length /= len(all_doc_lengths) 
       
#     scores_dictionary = dict()
    
#     # Will run for number of times of queries in topics.xml
#     for query_id, query in queries_dict.items(): # run for 10 times
#         # Split one query in terms
#         splitted_query = query_processing(query)
#         score_of_each_term_for_single_doc = 0
        
#         # For loop for each term in single queries,
#         # i.e. if there is 3 word query, it will run for 3 times.
#         for i in range(0,len(splitted_query)):            
#             # Reset : docs_score_for_each_query.
#             docs_score_for_each_query = dict()
            
#             if(splitted_query[i]  == 'carl'):
#                 print("yes 1")
            
#             # Check If this splitted term (from query) exists in my vocablury.
#             if splitted_query[i] in vocablury: # if that term exists
#                 # If YES, then get term_id of this splitted term (from query).
#                 # Now, get value = term_id by passing term as key.
#                 term_id = vocablury[splitted_query[i]]
                
#                 if(splitted_query[i]  == 'carl'):
#                     print("yes 2 term_id = ", term_id)
                
#                 # Get list of documents in which this term exists
#                 list_of_all_docs_in_which_term_exists = cleared_dict_termid_with_docs_postings[(term_id)]
                
#                 if(splitted_query[i]  == 'carl'):
#                     print("yes 3 list_of_all_docs_in_which_term_exists = ", list_of_all_docs_in_which_term_exists)
                
#                 # Get it's document_frequency, i.e. In how many docs it is present.
#                 df_i = len(list_of_all_docs_in_which_term_exists)
                
#                 if(splitted_query[i]  == 'carl'):
#                     print("yes 4 df = ", df_i)
                
#                 #doc_score_of_each_term_in_query_in_docs = dict()
                
#                 # Now, run this loop for all docs in which it is present.
#                 # i.e if it is present in 3 docs, it will run for 3 
#                 for j in range(0,len(list_of_all_docs_in_which_term_exists)):
#                     # Now, pick one by one doc_id, and compute score.
#                     doc_id = list_of_all_docs_in_which_term_exists[j]
                    
#                     # Check IF that doc_id is present in my doc_postings file
#                     if doc_id in all_doc_lengths:
#                         # Get this term's frquency in this document
#                         tf_q_i = 1
#                         tf_d_i = len(dict_term_id_with_docs_and_positions[str(term_id)][str(doc_id)])
#                         length_of_doc_id = int(all_doc_lengths[(doc_id)])
#                         capital_K = k1 * ((1-b) + (b * (length_of_doc_id/avg_length)))
                        
#                         if(splitted_query[i]  == 'carl'):
#                             print("yes 5 K = ", capital_K)

#                         a = float(D + 0.5)
#                         b = float(df_i + 0.5)
#                         c = float(math.log(a/b)) ###
#                         d = float((1+k1) * tf_d_i) ###
#                         e = float(capital_K + tf_d_i) ###
#                         f = float((1+k2) * tf_q_i) ###
#                         g = float(k2+tf_q_i) ###
                        
#                         score_of_each_term_for_single_doc = c * (d/e) * (f/g)
#                         doc_name = all_doc_names[(doc_id)]
                        
#                         if(splitted_query[i]  == 'carl'):
#                             print("yes 6 score_of_each_term_for_single_doc = ", score_of_each_term_for_single_doc)
                        
#                         # Check IF already a term might have calculated score for this document (for same query).
#                         # Or we can say, multiple term words of single query might present in same document.
#                         # If YES: Else NO
#                         if str(doc_id) in docs_score_for_each_query:
#                             #print("han kch para tha")
#                             prev_score = docs_score_for_each_query[(doc_id)]
#                             new_score = prev_score + score_of_each_term_for_single_doc
#                             docs_score_for_each_query[doc_id] = new_score
                            
#                             if(splitted_query[i]  == 'carl'):
#                                 print("yes 7 IF = ", new_score)
                                
#                             scores_dictionary[query_id] = docs_score_for_each_query

#                         # Or maybe we found new term in new document
#                         else:
#                             #print("sab naya hai")
#                             if(splitted_query[i]  == 'carl'):
#                                 print("yes 7 ELSE = ", score_of_each_term_for_single_doc)
                            
#                             docs_score_for_each_query[(doc_id)] = int()
#                             docs_score_for_each_query[(doc_id)] = score_of_each_term_for_single_doc
                            
#                             scores_dictionary[query_id] = dict()
#                             scores_dictionary[query_id] = docs_score_for_each_query
#                     # Or maybe there is a document which is not in my possession.
#                     else:
#                         if(splitted_query[i]  == 'carl'):
#                                 print("yes 8 = ", 0)
#                         docs_score_for_each_query[(doc_id)] = 0
#                         scores_dictionary[query_id] = docs_score_for_each_query
#             # If this term is not in my Vocablury
#             else: 
#                 #if(splitted_query[i]  == 'carl'):
                
#                 print("Terms of Queries which are not in my collection = ", splitted_query[i])
                
#                 #docs_score_for_each_query[(doc_id)] = 0
#                 #print(docs_score_for_each_query)

#     return scores_dictionary

In [464]:
# ### build version 2.2 : final removed unnecessary comments & print statements, and added proper comments
# def calculate_okapi_bm25(parameters):
#     k1 = parameters['k1']
#     k2 = parameters['k2']
#     b = parameters['b']
#     D = parameters['D']
    
#     queries_file_name = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/topics.xml"
#     queries_dict = xml_parser(queries_file_name) 
    
#     doc_info_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_lengths = get_all_documents_length(doc_info_file)
    
#     docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
#     all_doc_names = get_all_documents_name(docid_hashed_file)
    
#     voc_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/termid_hashed.txt"
#     vocablury = get_all_vocablury(voc_file)
    
#     doc_postings_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/document_postings.txt"
#     dict_termid_with_docs_postings = get_document_postings(doc_postings_file)
    
#     cleared_dict_termid_with_docs_postings = dict()
#     for kys, values in dict_termid_with_docs_postings.items():
#         c = dict_termid_with_docs_postings[kys]
#         z = c[2].strip("\n")
#         x = z.strip("[]")
#         s = x.split(",")
#         cleared_dict_termid_with_docs_postings[kys] = s
    
#     inverted_index_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/term_index_hashed.txt"
#     dict_term_id_with_docs_and_positions = get_term_frequency(inverted_index_file)
    
#     avg_length = 0
#     for doc_id, length in all_doc_lengths.items():
#         avg_length += int(length)
#     avg_length /= len(all_doc_lengths) 
       
#     scores_dictionary = dict()
    
#     # Will run for number of times of queries in topics.xml
#     for query_id, query in queries_dict.items(): # run for 10 times
#         # Split one query in terms
#         splitted_query = query_processing(query)
#         score_of_each_term_for_single_doc = 0
#         # Reset : docs_score_for_each_query.
#         docs_score_for_each_query = dict()
            
#         # For loop for each term in single queries,
#         # i.e. if there is 3 word query, it will run for 3 times.
#         for i in range(0,len(splitted_query)):            
#             # Check If this splitted term (from query) exists in my vocablury.
#             if splitted_query[i] in vocablury: # if that term exists
#                 # If YES, then get term_id of this splitted term (from query).
#                 # Now, get value = term_id by passing term as key.
#                 term_id = vocablury[splitted_query[i]]
                
#                 # Get list of documents in which this term exists
#                 list_of_all_docs_in_which_term_exists = cleared_dict_termid_with_docs_postings[(term_id)]
                
#                 # Get it's document_frequency, i.e. In how many docs it is present.
#                 df_i = len(list_of_all_docs_in_which_term_exists)
            
#                 #doc_score_of_each_term_in_query_in_docs = dict()
                
#                 # Now, run this loop for all docs in which it is present.
#                 # i.e if it is present in 3 docs, it will run for 3 
#                 for j in range(0,len(list_of_all_docs_in_which_term_exists)):
#                     # Now, pick one by one doc_id, and compute score.
#                     doc_id = list_of_all_docs_in_which_term_exists[j]
                    
#                     # Check IF that doc_id is present in my doc_postings file
#                     if doc_id in all_doc_lengths:
#                         # Get this term's frquency in this document
#                         tf_q_i = 1
#                         tf_d_i = len(dict_term_id_with_docs_and_positions[str(term_id)][(doc_id)])
#                         length_of_doc_id = int(all_doc_lengths[(doc_id)])
#                         capital_K = k1 * ((1-b) + (b * (length_of_doc_id/avg_length)))

#                         a = float(D + 0.5)
#                         b = float(df_i + 0.5)
#                         c = float(math.log(a/b)) ###
#                         d = float((1+k1) * tf_d_i) ###
#                         e = float(capital_K + tf_d_i) ###
#                         f = float((1+k2) * tf_q_i) ###
#                         g = float(k2+tf_q_i) ###
                        
#                         score_of_each_term_for_single_doc = c * (d/e) * (f/g)
#                         doc_name = all_doc_names[(doc_id)]
                        
#                         # Check IF already a term might have calculated score for this document (for same query).
#                         # Or we can say, multiple term words of single query might present in same document.
#                         # If YES: Else NO
#                         if doc_id in docs_score_for_each_query:
#                             prev_score = docs_score_for_each_query[(doc_id)]
#                             new_score = prev_score + score_of_each_term_for_single_doc
#                             docs_score_for_each_query[doc_id] = new_score
                            
#                             scores_dictionary[query_id] = docs_score_for_each_query

#                         # Or maybe we found new term in new document
#                         else:
#                             docs_score_for_each_query[(doc_id)] = int()
#                             docs_score_for_each_query[(doc_id)] = score_of_each_term_for_single_doc
                            
#                             scores_dictionary[query_id] = dict()
#                             scores_dictionary[query_id] = docs_score_for_each_query
#                     # Or maybe there is a document which is not in my possession.
#                     else:
#                         docs_score_for_each_query[(doc_id)] = 0
#                         scores_dictionary[query_id] = docs_score_for_each_query
#             # If this term is not in my Vocablury
#             else: 
#                 print("Terms of Queries which are not in my collection = ", splitted_query[i])

#     return scores_dictionary

In [480]:
### build version 2.3 :
#Changes Made: 
# final removed unnecessary comments & print statements, and added proper comments & replaced
# doc_id with doc_names as key of docs_score_for_each_query()
def calculate_okapi_bm25(parameters):
    k1 = parameters['k1']
    k2 = parameters['k2']
    b = parameters['b']
    D = parameters['D']
    
    queries_file_name = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/topics.xml"
    queries_dict = xml_parser(queries_file_name) 
    
    doc_info_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
    all_doc_lengths = get_all_documents_length(doc_info_file)
    
    docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
    all_doc_names = get_all_documents_name(docid_hashed_file)
    
    voc_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/termid_hashed.txt"
    vocablury = get_all_vocablury(voc_file)
    
    doc_postings_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/document_postings.txt"
    dict_termid_with_docs_postings = get_document_postings(doc_postings_file)
    
    inverted_index_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/term_index_hashed.txt"
    dict_term_id_with_docs_and_positions = get_term_frequency(inverted_index_file)
    
    avg_length = 0
    for doc_id, length in all_doc_lengths.items():
        avg_length += int(length)
    avg_length /= len(all_doc_lengths) 
       
    scores_dictionary = dict()
    # Will run for number of times of queries in topics.xml
    for query_id, query in queries_dict.items(): # run for 10 times
        
        # Split one query in terms
        splitted_query = query_processing(query)
        score_of_each_term_for_single_doc = 0
        # Reset : docs_score_for_each_query.
        docs_score_for_each_query = dict()
        #scores_dictionary[query_id] = dict()    
        # For loop for each term in single queries,
        # i.e. if there is 3 word query, it will run for 3 times.
        for i in range(0,len(splitted_query)):            
            # Check If this splitted term (from query) exists in my vocablury.
            if splitted_query[i] in vocablury: # if that term exists
                # If YES, then get term_id of this splitted term (from query).
                # Now, get value = term_id by passing term as key.
                term_id = vocablury[splitted_query[i]]
                
                # Get list of documents in which this term exists
                list_of_all_docs_in_which_term_exists = dict_termid_with_docs_postings[(term_id)]
                
                # Get it's document_frequency, i.e. In how many docs it is present.
                df_i = len(list_of_all_docs_in_which_term_exists)
            
                #doc_score_of_each_term_in_query_in_docs = dict()
                
                # Now, run this loop for all docs in which it is present.
                # i.e if it is present in 3 docs, it will run for 3 
                for j in range(0,len(list_of_all_docs_in_which_term_exists)):
                    # Now, pick one by one doc_id, and compute score.
                    doc_id = list_of_all_docs_in_which_term_exists[j]
                    doc_name = all_doc_names[doc_id]
                    # Check IF that doc_id is present in my doc_postings file
                    if doc_id in all_doc_lengths:
                        # Get this term's frquency in this document
                        tf_d_i = len(dict_term_id_with_docs_and_positions[str(term_id)][(doc_id)])
                        tf_q_i = 1
                        length_of_doc_id = int(all_doc_lengths[(doc_id)])
                        capital_K = k1 * ((1-b) + (b * (length_of_doc_id/avg_length)))
                        a = float(D + 0.5)
                        b = float(df_i + 0.5)
                        c = float(math.log(a/b)) ###
                        d = float((1+k1) * tf_d_i) ###
                        e = float(capital_K + tf_d_i) ###
                        f = float((1+k2) * tf_q_i) ###
                        g = float(k2+tf_q_i) ###
                        score_of_each_term_for_single_doc = c * (d/e) * (f/g)
                        
                        # Check IF already a term might have calculated score for this document (for same query).
                        # Or we can say, multiple term words of single query might present in same document.
                        # If YES: Else NO
                        if doc_name in docs_score_for_each_query:
                            prev_score = docs_score_for_each_query[doc_name]
                            new_score = prev_score + score_of_each_term_for_single_doc
                            docs_score_for_each_query.update({doc_name : new_score})
                            sorted_docs_score_for_each_query = sorted(docs_score_for_each_query.items(), key=operator.itemgetter(1), reverse=True)
                            scores_dictionary[query_id] = sorted_docs_score_for_each_query
                        # Or maybe we found new document. Let's create a new key of doc_id on same query_id
                        else:
                            docs_score_for_each_query[doc_name] = int()
                            docs_score_for_each_query[doc_name] = score_of_each_term_for_single_doc
                            sorted_docs_score_for_each_query = sorted(docs_score_for_each_query.items(), key=operator.itemgetter(1),reverse=True)
                            scores_dictionary[query_id] = sorted_docs_score_for_each_query
                            #scores_dictionary[query_id] = dict()
                            scores_dictionary[query_id] = sorted_docs_score_for_each_query
                    # Or maybe there is a document which is not in my possession.
                    else:
                        docs_score_for_each_query[doc_name] = 0
                        scores_dictionary[query_id] = docs_score_for_each_query
            # If this term is not in my Vocablury
            else: 
                print("Terms of Queries which are not in my collection = ", splitted_query[i])

    return scores_dictionary

In [481]:
parameters = dict()
parameters['k1'] = 1.2 
parameters['k2'] = 300
parameters['b'] = 0.75
parameters['D'] = 17


calculate_okapi_bm25(parameters)

Terms of Queries which are not in my collection =  uss
Terms of Queries which are not in my collection =  vinson
Terms of Queries which are not in my collection =  gain
Terms of Queries which are not in my collection =  2008
Terms of Queries which are not in my collection =  dog
Terms of Queries which are not in my collection =  world'
Terms of Queries which are not in my collection =  war


{'202': [('clueweb12-1012wb-63-19337', 2.31304098769458)],
 '214': [('clueweb12-1905wb-14-19033', 3.3881091795214116),
  ('clueweb12-0211wb-75-04122', 3.1340425731532746),
  ('clueweb12-0800tw-39-05237', 2.1445981205836477),
  ('clueweb12-0200wb-25-11228', 1.2341386021338656)],
 '216': [('clueweb12-1302wb-14-07756', 5.95190162460523),
  ('clueweb12-1705wb-99-01272', 4.7808621413327375),
  ('clueweb12-0800tw-39-05237', 4.1547868035711435),
  ('clueweb12-0001wb-96-10862', 2.2510214596424722),
  ('clueweb12-1905wb-44-08158', 0.8720184255815691),
  ('clueweb12-0200wb-25-11228', 0.2882922342511921)],
 '221': [('clueweb12-1905wb-14-19033', 472.55739438597215),
  ('clueweb12-0211wb-75-04122', 13.833527580491179),
  ('clueweb12-0303wb-53-27200', 3.756757749538382),
  ('clueweb12-1905wb-44-08158', 0.8720184255815691),
  ('clueweb12-0200wb-25-11228', 0.2882922342511921)],
 '227': [('clueweb12-1012wb-63-19337', 3.1937314883666224),
  ('clueweb12-1905wb-44-08158', 3.0258731499680662),
  ('clueweb1

In [499]:
## version v3.0
def dirichlet_smoothing():
    
    queries_file_name = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/topics.xml"
    queries_dict = xml_parser(queries_file_name) 
    
    doc_info_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
    all_doc_lengths = get_all_documents_length(doc_info_file)
    
    docid_hashed_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/docid_hashed.txt"
    all_doc_names = get_all_documents_name(docid_hashed_file)
    
    voc_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/termid_hashed.txt"
    vocablury = get_all_vocablury(voc_file)
    
    doc_postings_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/document_postings.txt"
    dict_termid_with_docs_postings = get_document_postings(doc_postings_file)
    
    inverted_index_file = "/Users/imbilalbutt/Documents/Semesters/Semester 9/Information Retrieval/Assignment/hw2/input/term_index_hashed.txt"
    dict_term_id_with_docs_and_positions = get_term_frequency(inverted_index_file)
    
    mu = 0
    total_length_of_collection = 0
    for doc_id, length in all_doc_lengths.items():
        total_length_of_collection += int(length)
    mu = total_length_of_collection/len(all_doc_lengths) 
    
    scores_dictionary = dict()
    # Will run for number of times of queries in topics.xml
    for query_id, query in queries_dict.items(): # run for 10 times
        # Split one query in terms
        splitted_query = query_processing(query)
        score_of_each_term_for_single_doc = 0
        # Reset : docs_score_for_each_query.
        docs_score_for_each_query = dict()
        
        # For loop for each term in single queries,
        # i.e. if there is 3 word query, it will run for 3 times.
        for i in range(0,len(splitted_query)):            
            # Check If this splitted term (from query) exists in my vocablury.
            if splitted_query[i] in vocablury: # if that term exists
                # If YES, then get term_id of this splitted term (from query).
                # Now, get value = term_id by passing term as key.
                term_id = vocablury[splitted_query[i]]
                
                # Get list of documents in which this term exists.
                list_of_all_docs_in_which_term_exists = dict_termid_with_docs_postings[(term_id)]
                
                # Count the number of times each word occurs in Corpora, divide by total length.
                # Basically add-up all lengths of documents.
                sum_of_term_in_whole_corpora = 0
                for d_idx, positinos in dict_term_id_with_docs_and_positions[term_id].items():
                    sum_of_term_in_whole_corpora += len(positinos)
                prob_of_term_occuring_in_whole_corpora = sum_of_term_in_whole_corpora/total_length_of_collection
                
                # Now, run this loop for all docs in which it is present.
                # i.e if it is present in 3 docs, it will run for 3.
                for j in range(0,len(list_of_all_docs_in_which_term_exists)):
                    # Now, pick one by one doc_id, and compute score.
                    doc_id = list_of_all_docs_in_which_term_exists[j]
                    doc_name = all_doc_names[doc_id]
                    # Check IF that doc_id is present in my doc_postings file.
                    if doc_id in all_doc_lengths:
                        N = int(all_doc_lengths[(doc_id)]) # doc length
                        lamdba = N/(N+mu)
                        one_minus_lamdba = 1 - lamdba
                        
                        # Count the number of times word occurs in document, divide by document length.
                        count_of_term_in_single_doc = len(dict_term_id_with_docs_and_positions[term_id][doc_id])
                        prob_occuring_in_single_doc = count_of_term_in_single_doc / N
                        
                        score_of_each_term_for_single_doc = (lamdba * prob_occuring_in_single_doc) + (one_minus_lamdba * prob_of_term_occuring_in_whole_corpora)
                                                
                        # Check IF already a term might have calculated score for this document (for same query).
                        # Or we can say, multiple term words of single query might present in same document.
                        # If YES: Else NO
                        if doc_name in docs_score_for_each_query:
                            prev_score = docs_score_for_each_query[doc_name]
                            new_score = prev_score + score_of_each_term_for_single_doc
                            docs_score_for_each_query.update({doc_name : new_score})
                            sorted_docs_score_for_each_query = sorted(docs_score_for_each_query.items(), key=operator.itemgetter(1), reverse=True)
                            scores_dictionary[query_id] = sorted_docs_score_for_each_query
                        # Or maybe we found new document. Let's create a new key of doc_id on same query_id.
                        else:
                            docs_score_for_each_query[doc_name] = int()
                            docs_score_for_each_query[doc_name] = score_of_each_term_for_single_doc
                            sorted_docs_score_for_each_query = sorted(docs_score_for_each_query.items(), key=operator.itemgetter(1),reverse=True)
                            scores_dictionary[query_id] = sorted_docs_score_for_each_query
                            #scores_dictionary[query_id] = dict()
                            scores_dictionary[query_id] = sorted_docs_score_for_each_query
                    # Or maybe there is a document which is not in my possession.
                    else:
                        docs_score_for_each_query[doc_name] = 0
                        scores_dictionary[query_id] = docs_score_for_each_query
            # If this term is not in my Vocablury.
            else: 
                print("Terms of Queries which are not in my collection = ", splitted_query[i])

    return scores_dictionary

In [500]:
dirichlet_smoothing()

Terms of Queries which are not in my collection =  uss
Terms of Queries which are not in my collection =  vinson
Terms of Queries which are not in my collection =  gain
Terms of Queries which are not in my collection =  2008
Terms of Queries which are not in my collection =  dog
Terms of Queries which are not in my collection =  world'
Terms of Queries which are not in my collection =  war


{'202': [('clueweb12-1012wb-63-19337', 0.00116860351879504)],
 '214': [('clueweb12-1905wb-14-19033', 0.00179380664652568),
  ('clueweb12-0211wb-75-04122', 0.001541457082589648),
  ('clueweb12-0200wb-25-11228', 0.001483216237314598),
  ('clueweb12-0800tw-39-05237', 0.0011936339522546418)],
 '216': [('clueweb12-0800tw-39-05237', 0.005371352785145889),
  ('clueweb12-0001wb-96-10862', 0.004246395806028834),
  ('clueweb12-1705wb-99-01272', 0.003131922694981285),
  ('clueweb12-1302wb-14-07756', 0.0018557366467645638),
  ('clueweb12-1905wb-44-08158', 0.0015234814863880242),
  ('clueweb12-0200wb-25-11228', 0.000897736143637783)],
 '221': [('clueweb12-0303wb-53-27200', 0.006728928592145716),
  ('clueweb12-1905wb-14-19033', 0.006136706948640484),
  ('clueweb12-0211wb-75-04122', 0.005273405808859322),
  ('clueweb12-1905wb-44-08158', 0.0014572431608928928),
  ('clueweb12-0200wb-25-11228', 0.000858704137392662)],
 '227': [('clueweb12-1012wb-63-19337', 0.0037005778095176266),
  ('clueweb12-1905wb-44

# Evaluation


To evaluate your results, we will write a program that computes mean average precision of the 
rank list of documents for different queries. The input to program will be the <font color=blue> qrel file 
(relevance judgments) </font> and scoring file that has rank list of documents. 

The output should be following measures: 
    
<font color=red>  </font> <font color=green> P@5  </font>

<font color=red>  </font> <font color=green> P@10 </font>

<font color=red>  </font> <font color=green> P@20 </font>

<font color=red>  </font> <font color=green> P@30 </font>

<font color=red>  </font> <font color=green> MAP </font>

These measures should be computed for each query. Average for all queries should also be computed.
