# BIM 25
Let's approach BIM25 step by step, which means modeling the BIM and then gradually extending it. 
We start from the naive assumption that we do not have any relevance feedbacks.

In [2]:
import pandas as pd
import numpy as np
import pickle
#We need this line to find the collection_vocabulary.py here, else we cannot load the col.pkl object
import sys
sys.path.append('../0_Collection_and_Inverted_Index/')
with open('../0_Collection_and_Inverted_Index/pickle/col.pkl', 'rb') as input:
    col = pickle.load(input)
inverted_index = pd.read_pickle('../0_Collection_and_Inverted_Index/pickle/inverted_index.pkl')

### BIM 
This simplification results in the following formula we want to compute:
w_t= log(0.5 * N/N_t)

N_t signifies in how many documents a term appears. This is what we already calucalted as the 'raw' document frequency in the TFIDF-model above. 
What we are basically doing is multiplying the raw inverse document frequency by 0.5 and then taking the logarithm.

Note: This can (and is intended to) produce negative values for words occuring in almost every document.

In [3]:
df=(inverted_index>0).sum(axis=1)
raw_idf=(col.collection_size/df)
BIM= np.log10(raw_idf*0.5)
BIM.head()

'hort    3.259235
+        2.782114
-        3.259235
--a      3.259235
--all    3.259235
dtype: float64

In [4]:
# observation: in BIM 25 weights may actually become negative - we have four negative weights
sum(BIM<0)

4

### BM 25
Let's focus on the weighting part and then multiply these weights with the BIM weights from above.

In [5]:
# parameters as presented in the lecture
k=1.5
b=0.25
document_length= inverted_index.sum()
average_document_length= col.collection_length/col.collection_size # 146.20478943022295 TODO: include in project report
doc_len_div_by_avg_doc_len= document_length/average_document_length
#sanity check, should yield 3633
doc_len_div_by_avg_doc_len.sum()

3633.000000000004

In [7]:
weighting_bim25_nominator= inverted_index*k*(k+1)
weighting_bim25_nominator.shape

(29052, 3633)

In [8]:
#the denominator is the tricky part since we have to add scalars and a vector to each column in the inverted index at the same time
weighting_bim25_denominator=inverted_index.add((doc_len_div_by_avg_doc_len*k*b), axis=1)+(k*(1-b))
weighting_bim25_denominator.shape

(29052, 3633)

In [9]:
#merging nominator and denominator
weighting_bim25= weighting_bim25_nominator.div(weighting_bim25_denominator)
#sanity check: 29052, 3633 ?
weighting_bim25.shape

(29052, 3633)

Combining the weights, and the vanilla BIM from above, we can now construct BIM25.

In [10]:
BIM25=weighting_bim25.mul(BIM, axis=0)
BIM25.to_pickle('pickle/BIM25.pkl')

In the paper (http://www.cl.uni-heidelberg.de/~riezler/publications/papers/ECIR2016.pdf), they used different values for k and b, so we will try this approach as well

In [6]:
# parameters as presented in paper
k=1.2
b=0.75
document_length= inverted_index.sum()
average_document_length= col.collection_length/col.collection_size # 146.20478943022295 TODO: include in project report
doc_len_div_by_avg_doc_len= document_length/average_document_length
#sanity check, should yield 3633
doc_len_div_by_avg_doc_len.sum()

3633.000000000004

In [12]:
weighting_bim25_nominator= inverted_index*k*(k+1)
weighting_bim25_nominator.shape

(29052, 3633)

In [13]:
#merging nominator and denominator
weighting_bim25= weighting_bim25_nominator.div(weighting_bim25_denominator)
#sanity check: 29052, 3633 ?
weighting_bim25.shape

(29052, 3633)

In [14]:
BIM25_alt=weighting_bim25.mul(BIM, axis=0)
BIM25_alt.to_pickle('pickle/BIM25_alt.pkl')