# TFIDF, BM25, UnigramLM

## 1.Collection and Vocabulary

Import *collection_vocabulary.py* and create an instance of the collection class. Its attributes are the collection itself, the vocabulary, and some descriptive summary statistics (e.g. vocabulary size, collection size, collection length).

**Table of Contents**
1. Collection and Vocabulary
2. Document Term Matrix
3. Inverted Index
4. TFIDF
5. Unigram LM with J-M-Smoothing
6. BM25
7. Sample Queries

We will fix and use the parameters of the different models as discussed in the lecture. 

In [1]:
from collection_vocabulary import Collection
col=Collection()

## 2. Document Term Matrix
The document term matrix is obtained as a lists of lists, and then converted to a Pandas dataframe, which is stored to disk to facilitate debugging, and further experimentation.

In [2]:
doc_term_matrix=[]
for doc in col.collection:
    tf_vector =[]
    for word in col.vocabulary:
        n= col.collection[doc].count(word)
        tf_vector.append(n)
    doc_term_matrix.append(tf_vector)

In [3]:
import pandas as pd
import numpy as np
doc_term_matrix= pd.DataFrame(data=doc_term_matrix,index= col.collection.keys(),columns=col.vocabulary)
doc_term_matrix.to_pickle('doc_term_matrix.pkl')

In [4]:
doc_term_matrix.head(3) # this is how the doc term matrix looks like

Unnamed: 0,'hort,+,-,--a,--all,--have,--mainly,--of,--showed,--the,...,zooplankton,zoxazolamine,zr,zu,zuccarini,zucchini,zugesetztem,zusatzstoffe-online,zygote,zymography
MED-10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MED-14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MED-118,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Sanity Check: should have dimensions 3633*29522
doc_term_matrix.shape

(3633, 29052)

In [6]:
# some summary stats for our project report and a sanity check that would reveal any empty docs
doc_term_matrix.sum(axis=1).describe()

count    3633.000000
mean      146.204789
std        53.995711
min        11.000000
25%       114.000000
50%       148.000000
75%       174.000000
max       939.000000
dtype: float64

## 3. Inverted Index

The inverted index is our unified (and in practice memory-efficient) way of representing the document term matrix that we will use in the remainder of this project.



This is a good point to start off the actual analysis and calculate different retrieval models.

In [7]:
inverted_index= doc_term_matrix.transpose()
inverted_index.to_pickle('inverted_index.pkl') # use later for embeddings, queries, ... 

In [8]:
# sanity check 1
# each term should occur at least once (implied by the way we construct the index), hence min>=1
inverted_index.sum(axis=1).min()

1

## 4. TF-IDF

### IDF

In [9]:
df=(inverted_index>0).sum(axis=1)

In [10]:
raw_idf=(col.collection_size/df)

In [11]:
raw_idf.tail()

zucchini               3633.0
zugesetztem            3633.0
zusatzstoffe-online    3633.0
zygote                 3633.0
zymography             1816.5
dtype: float64

In [12]:
idf= np.log10(raw_idf) #aka log of raw_idf
idf.tail()

zucchini               3.560265
zugesetztem            3.560265
zusatzstoffe-online    3.560265
zygote                 3.560265
zymography             3.259235
dtype: float64

In [13]:
# Sanity check: max tf score should be equal to number of docs in collection...
raw_idf.max().max()==3633

True

In [14]:
# Sanity check: ... and max idf score should be substantially lower
idf.max().max()

3.5602653978627146

### TF
Raw term frequency is what we obtain when we look columnwise at the  *inverted_index* dataframe.
As discussed in the lecture, we will normalize this frequency by dividing with the raw frequency of the most frequent term in each document. Next, we then take the logarithm (any logarithm will do the job) since we assume that relevance does not increase linearly with term frequency.

In [15]:
# nominator part
nominator= inverted_index.applymap(lambda x: np.log10(x, out=np.zeros_like(inverted_index.as_matrix),where=x!=0))
nominator= nominator+1

In [16]:
nominator.shape

(29052, 3633)

In [17]:
# denominator part
most_frequent_term=inverted_index.max(axis=0) # determine most frequent term in each doc
denominator= np.log10(most_frequent_term)
denominator+=1

In [18]:
denominator.shape

(3633,)

In [19]:
#sanity check, there shouldn't be any zeros
denominator.min()

1.0

In [20]:
#sanity check, there shouldn't be any zeros
nominator.min().min()

1.0

In [21]:
#tf
tf=nominator.div(denominator, axis=1)

### TFIDF
Bringing the pieces together.

In [22]:
tfidf= tf.mul(idf, axis=0) # we multiply the tf scores in every doc with the corresponding idf scores
tfidf.to_pickle('tfidf.pkl')

In [23]:
tfidf.shape

(29052, 3633)

# 5. Unigram LM with Jelinek-Mercer-Smoothing


### Global Language Model
We want to find out how likely each word is if we look at the whole corpus. 

In [24]:
global_LM=inverted_index.sum(axis=1)/col.collection_length # equal: inverted_index.sum(axis=1)/inverted_index.sum(axis=1).sum()

### Local Language Models
We want to obtain a language model for each document in the collection, therefore we look at the columns of the *inverted_index* dataframe. 

In [25]:
local_LMs=inverted_index/inverted_index.sum()

In [26]:
# Sanity Check: Probabilities should sum columnwise to 1, and adding all columns should yield the collection size (3633)
local_LMs.sum().sum()

3633.0

### Unigram LM with J-M-Smoothing
As introduced in the lecture, this smoothing scheme assigns equal weights to the global and local LMs.

In [27]:
unigram_LM= (local_LMs.apply(lambda x: x+ global_LM)).apply(lambda x: x/2)# same as multiplying both by 0.5 and adding them
unigram_LM.to_pickle('unigramLM.pkl')#writing Unigram-LM to disk

In [28]:
# sanity check: probabilities in every doc should sum up to one and all docs should sum up tp 3633
unigram_LM.sum().sum() 

3633.0000000012674

In [29]:
#sanity check: we don't want to have any negative values 
unigram_LM.min().min()<0

False

In [30]:
#sanity check: we don't want to have any zeros (since we are smoothing)
unigram_LM.isnull().sum(axis=1).sum()==0 # we check whether therer are no zeros > True intended

True

In [31]:
'''
omitted: operating in log-space to avoid numerical instability
global_LM_log_space=np.log(global_LM)
local_LMs_log_space=local_LMs.applymap(lambda x: np.log(x, out=np.zeros_like(inverted_index.as_matrix),where=x!=0))

 '''

'\nomitted: operating in log-space to avoid numerical instability\nglobal_LM_log_space=np.log(global_LM)\nlocal_LMs_log_space=local_LMs.applymap(lambda x: np.log(x, out=np.zeros_like(inverted_index.as_matrix),where=x!=0))\n\n '

# 6. BIM 25
Let's approach BIM25 step by step, which means modeling the BIM and then gradually extending it. 
We start from the naive assumption that we do not have any relevance feedbacks.


### BIM 
This simplification results in the following formula we want to compute:
w_t= log(0.5 * N/N_t)

N_t signifies in how many documents a term appears. This is what we already calucalted as the 'raw' document frequency in the TFIDF-model above. 
What we are basically doing is multiplying the raw inverse document frequency by 0.5 and then taking the logarithm.

Note: This can (and is intended to) produce negative values for words occuring in almost every document.

In [32]:
BIM= np.log10(raw_idf*0.5) # raw idf calculated above in tfidf
BIM.head()

'hort    3.259235
+        2.782114
-        3.259235
--a      3.259235
--all    3.259235
dtype: float64

In [33]:
# observation: in BIM 25 weights may actually become negative - we have four negative weights
sum(BIM<0)

4

### BM 25
Let's focus on the weighting part and then multiply these weights with the BIM weights from above.

In [34]:
# parameters as presented in the lecture
k=1.5
b=0.25
document_lenght= inverted_index.sum()
average_document_length= col.collection_length/col.collection_size # 146.20478943022295 TODO: include in project report
doc_len_div_by_avg_doc_len= document_lenght/average_document_length
#sanity check, should yield 3633
doc_len_div_by_avg_doc_len.sum() == 3633

False

In [35]:
weighting_bim25_nominator= inverted_index.apply(lambda x: x*(k+1))
weighting_bim25_nominator.shape

(29052, 3633)

In [36]:
#the denominator is the tricky part since we have to add scalars and a vector to each column in the inverted index at the same time
weighting_bim25_denominator=inverted_index.add((doc_len_div_by_avg_doc_len*k*b), axis=1)+(k*(1-b))
weighting_bim25_denominator.shape

(29052, 3633)

In [37]:
#merging nominator and denominator
weighting_bim25= weighting_bim25_nominator.div(weighting_bim25_denominator)

In [38]:
#sanity check: 29052, 3633 ?
weighting_bim25.shape

(29052, 3633)

Combining the weights, and the vanilla BIM from above, we can now construct BIM25.

In [39]:
BIM25=weighting_bim25.mul(BIM, axis=0)
BIM25.to_pickle('BIM25.pkl')

# 7. Querying

@Philipp Naeser: please look at *query.py* which already specifies in a class how a query should look like.
Actually it may make sense to reinvert the inverted index, so you can pick the tfidf value columnwise.

### Sample single term queries

Let's look at the same single term query  - "cancer". And compare the results of the three retrieval models.

In [40]:
#TFIDF
a= tfidf.loc['cancer'].sort_values(ascending=False).head(10) # if you transpose you can directly select by the index term  > tf.transpose().cancer
a

MED-3718    0.695754
MED-2322    0.695754
MED-5063    0.695754
MED-2137    0.695754
MED-4096    0.695754
MED-4643    0.695754
MED-4117    0.695754
MED-1414    0.695754
MED-3551    0.695754
MED-3550    0.695754
Name: cancer, dtype: float64

In [41]:
# Unigram LM
b= unigram_LM.loc['cancer'].sort_values(ascending=False).head(10)
b

MED-3703    0.081339
MED-2137    0.061650
MED-2174    0.057772
MED-4391    0.052210
MED-890     0.048745
MED-5184    0.048281
MED-3551    0.047909
MED-3555    0.047347
MED-2258    0.045462
MED-3699    0.044962
Name: cancer, dtype: float64

In [42]:
c= BIM25.loc['cancer'].sort_values(ascending=False).head(10)
c

MED-3703    0.917499
MED-1721    0.895649
MED-2760    0.894057
MED-3699    0.892689
MED-3555    0.884759
MED-14      0.881640
MED-4928    0.880848
MED-5353    0.877033
MED-4050    0.876367
MED-4785    0.876234
Name: cancer, dtype: float64

Obviously, there is very little overlap in the top 10 retrieved documents. Only the top-ranked doc of the probabilisitic ranking models matches.

Now, let's get the query representations and compute the scores for each document

In [43]:
#To speed things up, load the pickle files. If you just ran the whole script, you do not need this step
#Note: pkl files are excluded from git for being to large, so you have to run the whole script once
tfidf = pd.read_pickle('tfidf.pkl')
BIM25 = pd.read_pickle('BIM25.pkl')
unigram_LM = pd.read_pickle('unigramLM.pkl')

In [44]:
#Now to get the queries
train_queries = pd.read_csv('nfcorpus/train.all.queries', sep='\t', header=None)
train_queries.columns = ['id', 'text']
dev_queries = pd.read_csv('nfcorpus/dev.all.queries', sep='\t', header=None)
dev_queries.columns = ['id', 'text']
test_queries = pd.read_csv('nfcorpus/test.all.queries', sep='\t', header=None)
test_queries.columns = ['id', 'text']

#And the relevance scores given
train_rel = pd.read_csv('nfcorpus/train.3-2-1.qrel', sep='\t', header=None)
print(train_rel.describe())
test_rel = pd.read_csv('nfcorpus/test.3-2-1.qrel', sep='\t', header=None)
print(test_rel.describe())
dev_rel = pd.read_csv('nfcorpus/dev.3-2-1.qrel', sep='\t', header=None)
#As we can see, column 1 is always 0, so drop it
train_rel = train_rel.drop([1], axis=1)
dev_rel = dev_rel.drop([1], axis=1)
test_rel = test_rel.drop([1], axis=1)
train_rel.columns = ['qid', 'docid', 'rel']
dev_rel.columns = ['qid', 'docid', 'rel']
test_rel.columns = ['qid', 'docid', 'rel']

              1              3
count  139350.0  139350.000000
mean        0.0       1.824212
std         0.0       0.454204
min         0.0       1.000000
25%         0.0       2.000000
50%         0.0       2.000000
75%         0.0       2.000000
max         0.0       3.000000
             1             3
count  15820.0  15820.000000
mean       0.0      1.816056
std        0.0      0.472168
min        0.0      1.000000
25%        0.0      2.000000
50%        0.0      2.000000
75%        0.0      2.000000
max        0.0      3.000000


In [45]:
#Just some experimenting, delete this later
for doc in test_rel.loc[(test_rel['qid'] == 'PLAIN-2')].docid:
    print(doc)
rel_temp = test_rel.loc[(test_rel['qid'] == 'PLAIN-2')]
for row in rel_temp.itertuples():
    print(row.rel)

MED-2427
MED-10
MED-2429
MED-2430
MED-2431
MED-14
MED-2432
MED-5322
MED-5323
MED-5324
MED-5325
MED-5326
MED-5327
MED-5328
MED-5329
MED-5330
MED-5331
MED-5332
MED-5333
MED-5334
MED-5335
MED-5363
MED-5337
MED-5338
MED-5339
MED-5340
MED-5341
MED-5342
MED-2428
MED-2440
MED-2434
MED-2435
MED-2436
MED-2437
MED-2438
MED-2439
MED-3597
MED-3598
MED-3599
MED-4556
MED-4559
MED-4560
MED-4828
MED-4829
MED-4830
3
3
3
3
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2


In [46]:
#this function assumes that tfidf, bim25 and unigam_LM are set
def compute_scores(queries, rel):
    #Compute the scores, this went to be less lines of code than initially (and expected)
    doc_keys = tfidf.columns
    list_of_df = []
    query_keys = queries['id']
    for key in query_keys:
        text = str(queries.loc[queries['id'] == key].text)
        tfidf_scores = tfidf.loc[text.split()].sum()
        bim25_scores = BIM25.loc[text.split()].sum()
        unigram_scores = unigram_LM.loc[text.split()].product()
        total = pd.DataFrame()
        total['tfidf'] = tfidf_scores
        total['bim25'] = bim25_scores
        total['unigram'] = unigram_scores
        total['qid'] = key.replace('PLAIN-', '')
        #Rel only contains 1 and 2, everything that is not in there is set to 0
        total['rel'] = 0
        rel_temp = rel.loc[(rel['qid'] == key)]
        for row in rel_temp.itertuples():
            total.at[row.docid, 'rel'] = row.rel
        total.set_index(np.arange(len(doc_keys)))
        total.rename(columns={'': 'docid'}, inplace=True)
        list_of_df.append(total)

    scores = pd.concat(list_of_df)
    #Sanity check: should be same
    print(len(scores))
    print(len(doc_keys)*len(query_keys))
    return scores

In [47]:
train_scores = compute_scores(train_queries, train_rel)
dev_scores = compute_scores(dev_queries, dev_rel)
test_scores = compute_scores(test_queries, test_rel)

9424002
9424002
1180725
1180725
1180725
1180725


In [48]:
train_scores

Unnamed: 0,tfidf,bim25,unigram,qid,rel
MED-10,5.023640,0.000000,2.314569e-23,10,0
MED-14,4.971875,0.000000,2.314569e-23,10,0
MED-118,5.140227,0.000000,2.314569e-23,10,0
MED-301,6.371605,0.000000,2.314569e-23,10,0
MED-306,5.140227,0.000000,2.314569e-23,10,0
MED-329,6.668556,0.931327,1.404000e-22,10,0
MED-330,5.549975,0.000000,2.314569e-23,10,0
MED-332,5.797478,0.000000,2.314569e-23,10,0
MED-334,5.206435,0.000000,2.314569e-23,10,0
MED-335,5.664839,0.000000,2.314569e-23,10,0


In [49]:
#Here you can see that rel is not always 0
test_scores.describe()

Unnamed: 0,tfidf,bim25,unigram,rel
count,1180725.0,1180725.0,1180725.0,1180725.0
mean,7.588534,0.1618845,2.462048e-08,0.02433251
std,2.286501,0.5308435,7.655339e-06,0.2158334
min,0.9384311,0.0,1.4853279999999998e-44,0.0
25%,6.025946,0.0,5.475276e-35,0.0
50%,7.581093,0.0,2.436599e-31,0.0
75%,9.087797,0.0,1.8834000000000001e-25,0.0
max,23.33306,19.84373,0.006253765,3.0


In [50]:
train_scores.to_pickle('train_scores.pkl')
dev_scores.to_pickle('dev_scores.pkl')
test_scores.to_pickle('test_scores.pkl')

For reduced task: create files according to rank lib documentation: https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/

In [51]:
#Create csv for Ranklib, code taken from answer here: https://stackoverflow.com/questions/37439533/pandas-custom-file-format
feature_columns = ['tfidf','bim25','unigram']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}

def f(x):
    if x.name in feature_columns:
        return cols2id[x.name] + ':' + x.astype(str)
    elif x.name == 'qid':
        return 'qid:' + x.astype(str)
    else:
        return x

(train_scores.apply(lambda x: f(x))[['rel','qid'] + feature_columns]
  .to_csv('reduced_task/train.csv', sep=' ', index=False, header=None)
)
(dev_scores.apply(lambda x: f(x))[['rel','qid'] + feature_columns]
  .to_csv('reduced_task/dev.csv', sep=' ', index=False, header=None)
)
(test_scores.apply(lambda x: f(x))[['rel','qid'] + feature_columns]
  .to_csv('reduced_task/test.csv', sep=' ', index=False, header=None)
)

For Full task: use the generated scores to train and evaluate point and pairwise models