# Putting it all together

## 1: Creating the collection and vocabulary

Import *collection_vocabulary.py* and create an instance of the collection class. Its attributes are the collection itself, the vocabulary, and some descriptive summary statistics (e.g. vocabulary size, collection size, collection length).

In [1]:
from collection_vocabulary import Collection
col=Collection()

In [2]:
doc_term_matrix=[]
for doc in col.collection:
    tf_vector =[]
    for word in col.vocabulary:
        n= col.collection[doc].count(word)
        tf_vector.append(n)
    doc_term_matrix.append(tf_vector)

## 2: Creating the document term matrix
The document term matrix is obtained as a lists of lists from the collection created in step 1. It is then converted to a Pandas dataframe, which is stored to disk to facilitate debugging, and further experimentation.

In [5]:
import pandas as pd
import numpy as np
doc_term_matrix= pd.DataFrame(data=doc_term_matrix,index= col.collection.keys(),columns=col.vocabulary)
doc_term_matrix.to_pickle('doc_term_matrix.pkl')

In [38]:
doc_term_matrix.head(3) # this is how the doc term matrix looks like

Unnamed: 0,'hort,+,-,--a,--all,--have,--mainly,--of,--showed,--the,...,zooplankton,zoxazolamine,zr,zu,zuccarini,zucchini,zugesetztem,zusatzstoffe-online,zygote,zymography
MED-10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MED-14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MED-118,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
# Sanity Check: Assure that each doc contains at least one term, i.e. there are no empty docs
doc_term_matrix.sum(axis=1).min()

11

In [42]:
# some more summary stats for our project report
doc_term_matrix.sum(axis=1).max()

939

In [41]:
doc_term_matrix.sum(axis=1).median()

148.0

This is a good point to start off the actual analysis and calculate different retrieval models.

## 3. TF-IDF

### IDF

Let's start with the easier part: calcualating inverse document frequencies and storing them in a data frame. The index of all dataframes being produced in the remainder of this notebook will be the vocabulary ('the inverted index'), which allows applying the different retrieval models and querying the collection in a standardized way.

We will fix and use the parameters of the different models as discussed in the lecture. This may be subject to further hyperparameter tuning (bear in mind we are 'learning to rank.)

In [43]:
inverted_index= doc_term_matrix.transpose()
inverted_index.to_pickle('inverted_index.pkl') # let's also store the inverted index to disk
def greater_than_zero(some_value): return some_value > 0
def calculate_idf(some_value): return np.log10(col.collection_size/some_value)
idf=df.apply(calculate_idf)

In [44]:
# sanity check 1
# each term should occur at least once (implied by the way we construct the index), hence min>=1
inverted_index.sum(axis=1).min()

1

### TF

Raw term frequency is what we already have when we look columnwise at the  *inverted_index* dataframe.
As discussed in the lecture, we will normalize this frequency by dividing with the raw frequency of the most frequent term in each document. Next, we then take the logarithm (any logarithm will do the job) since we assume that relevance does not increase linearly with term frequency.

In [48]:
#TODO: double-check this with slides 14 and 19 of lecture 4, 
most_frequent_term=inverted_index.max(axis=0) # determine most frequent term in each doc
tf= (1+ np.log10(inverted_index)).div(np.log10(1+ most_frequent_term),axis=1)
tf.replace([np.inf, -np.inf], 0,inplace=True)

  This is separate from the ipykernel package so we can avoid doing imports until


### TFIDF
Bringing the pieces together.

In [229]:
tfidf= tf.mul(idf, axis=0) # we multiply the tf scores in every doc with the corresponding idf scores
tfidf.to_pickle('tfidf.pkl')

In [118]:
tfidf.describe()

Unnamed: 0,MED-10,MED-14,MED-118,MED-301,MED-306,MED-329,MED-330,MED-332,MED-334,MED-335,...,MED-938,MED-939,MED-940,MED-892,MED-906,MED-917,MED-941,MED-942,MED-952,MED-961
count,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,...,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0,29052.0
mean,0.004152,0.003803,0.002935,0.00635,0.00571,0.008275,0.004693,0.00638,0.004965,0.006263,...,0.005649,0.006764,0.006286,0.004697,0.005315,0.004602,0.00586,0.004637,0.004089,0.003638
std,0.082391,0.077375,0.069604,0.136185,0.105349,0.157469,0.097023,0.112068,0.100331,0.123847,...,0.116824,0.120984,0.111506,0.092023,0.123315,0.11075,0.118862,0.119924,0.102378,0.072841
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3.600756,3.083144,3.279604,6.749227,3.851621,5.952586,4.40711,4.553026,5.226977,5.921194,...,4.655919,6.681074,3.977361,5.34778,7.100402,4.240363,5.921194,6.090362,6.658925,2.740518


In [119]:
tfidf.shape

(29052, 3633)

In [120]:
#TODO (not part of the project): Think about storing this index as a real inverted index which may be used for look-up (e.g. Python dictionary)

# 3. Unigram LM with Jelinek-Mercer-Smoothing
Let's start where we always start off: generating/importing the inverted index and generating an instance of the collection class which already provides us with certain summary statistics.

### Global Language Model
We want to find out how likely each word is if we look at the whole corpus. 

In [122]:
global_LM=inverted_index.sum(axis=1)/col.collection_length # equal: inverted_index.sum(axis=1)/inverted_index.sum(axis=1).sum()

### Local Language Models
We want to obtain a language model for each document in the collection, therefore we look at the columns of the *inverted_index* dataframe. 

In [129]:
local_LMs=inverted_index/inverted_index.sum()

In [135]:
# Sanity Check: Probabilities should sum columnwise to 1, and adding all columns should yield the collection size (3638)
local_LM.sum().sum()

3632.9999999999995

### Unigram LM with J-M-Smoothing
As introduced in the lecture, this smoothing scheme assigns equal weights to the global and local LMs.

In [132]:
unigram_LM= (local_LM.apply(lambda x: x+ global_LM)).apply(lambda x: x/2)# same as multiplying both by 0.5
unigram_LM.to_pickle('unigramLM.pkl')#writing Unigram-LM to disk

In [133]:
# sanity check: probabilities in every doc should sum up to one and all docs should sum up tp 3633
unigram_LM.sum().sum() 

3633.0000000010787

# 4. BIM 25
Let's approach BIM25 step by step, which means modeling the BIM and then gradually extending it. 
We start from the naive assumption that we do not have any relevance feedbacks.


### BIM 
This simplification results in the following formula we want to compute:
w_t= log(0.5 * N/N_t)

N_t signifies in how many documents an term appears. This is what we already calucalted as the 'raw' document frequency in the TfIdf-model above. 
What we are basically doing is multiplying the raw inverse document frequency by 0.5 and then taking the logarithm.

In [54]:
BIM=df.apply(lambda x: col.collection_size/x).apply(lambda x: x*0.5).apply(lambda x: np.log10(x))
#TODO: Ask Robert/Goran about base of logarithm
#In Okapi BIM 25, no multiplcation with 0.5 according to this site: https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html

In [55]:
# TODO: Discuss this: Interesting observation: in BIM 25 weights may actually become negative
sum(BIM<0) # hence we are adding term weights this seems okay

4

### BIM 25
Let's focus on the weighting part and then multiply these weights with the BIM from above.

In [71]:
# parameters as presented in the lecture
k=1.5
b=0.25
document_lenght= inverted_index.sum()
average_document_length= col.collection_length/col.collection_size # 146.20478943022295 TODO: include in project report
doc_len_div_by_avg_doc_len= document_lenght.apply(lambda x: x/average_document_length)
#sanity check, should yield 3633
doc_len_div_by_avg_doc_len.sum() == 3633


True

In [137]:
weighting_bim25_nominator= inverted_index.apply(lambda x: x*(k+1))

In [183]:
#the denominator is the tricky part since we have to add scalars and a vector to each column in the inverted index at the sa,e time
weighting_bim25_denominator=inverted_index.add((doc_len_div_by_avg_doc_len*k*b), axis=1)+(k*(1-b))

In [182]:
#merging nominator and denominator
weighting_bim25= weighting_bim25_nominator.div(weighting_bim25_denominator)

In [185]:
#sanity check: 29052, 3633 ?
weighting_bim25.shape

(29052, 3633)

Combining the weights, and the vanilla BIM from above, we can now construct BIM25.

In [192]:
BIM25=weighting_bim25.mul(BIM, axis=0)
BIM25.to_pickle('BIM25.pkl')

# 5. Querying

@Philipp Naeser: please look at *query.py* which already specifies in a class how a query should look like.
Actually it may make sense to reinvert the inverted index, so you can pick the tfidf value columnwise.

### Sample single term queries

Let's look at the same single term query  - "cancer". And compare the results of the three retrieval models.

In [203]:
#TFIDF
a= tfidf.loc['cancer'].sort_values(ascending=False).head(10) # if you transpose you can directly select by the index term  > tf.transpose().cancer
a

MED-3717    1.897206
MED-1414    1.897206
MED-3729    1.706995
MED-4534    1.594690
MED-4465    1.594690
MED-2067    1.594690
MED-2577    1.594690
MED-2435    1.519069
MED-4470    1.519069
MED-2439    1.519069
Name: cancer, dtype: float64

In [199]:
# Unigram LM
b= unigram_LM.loc['cancer'].sort_values(ascending=False).head(10)
b

MED-3703    0.081339
MED-2137    0.061650
MED-2174    0.057772
MED-4391    0.052210
MED-890     0.048745
MED-5184    0.048281
MED-3551    0.047909
MED-3555    0.047347
MED-2258    0.045462
MED-3699    0.044962
Name: cancer, dtype: float64

In [202]:
c= BIM25.loc['cancer'].sort_values(ascending=False).head(10)
c

MED-3703    0.917499
MED-1721    0.895649
MED-2760    0.894057
MED-3699    0.892689
MED-3555    0.884759
MED-14      0.881640
MED-4928    0.880848
MED-5353    0.877033
MED-4050    0.876367
MED-4785    0.876234
Name: cancer, dtype: float64

Obviously, there is very little overlap in th top 10 retrieved documents. Only the top-ranked doc of the probabilisitic ranking models matches.