# Putting it all together

## 1: Creating the collection and vocabulary

Import *collection_vocabulary.py* and create an instance of the collection class. Its attributes are the collection itself, the vocabulary, and some descriptive summary statistics (e.g. vocabulary size, collection size, collection length).

In [1]:
from collection_vocabulary import Collection
col=Collection()

In [31]:
import os
filename='./doc_term_mat.h5'
doc_term_matrix = []
if not os.path.isfile(filename):
    for doc in col.collection:
        tf_vector =[]
        for word in col.vocabulary:
            n= col.collection[doc].count(word)
            tf_vector.append(n)
        doc_term_matrix.append(tf_vector)

## 2: Creating the document term matrix
The document term matrix is obtained as a lists of lists from the collection created in step 1. It is then converted to a Pandas dataframe, which is stored to disk to facilitate debugging, and further experimentation.

In [33]:
import pandas as pd
import numpy as np
doc_term_matrix= pd.DataFrame(data=doc_term_matrix,index= col.collection.keys(),columns=col.vocabulary)
hdf = pd.HDFStore('storage.h5') # create (or open) an hdf5 file and open in append mode
hdf['doc_term_matrix'] = doc_term_matrix # store.put('s', s) is an equivalent method
doc_term_matrix = hdf['doc_term_matrix'] # hdf.get('doc_term_matrix') is an equivalent method to retrieve dataframe

This is a good point to start off the actual analysis and calculate different retrieval models.

## 3. TF-IDF

### IDF

Let's start with the easier part: calcualating inverse document frequencies and storing them in a data frame. The index of all dataframes being produced in the remainder of this notebook will be the vocabulary ('the inverted index'), which allows applying the different retrieval models and querying the collection in a standardized way.

We will fix and use the parameters of the different models as discussed in the lecture. This may be subject to further hyperparameter tuning (bear in mind we are 'learning to rank.)

In [38]:
inverted_index= doc_term_matrix.transpose()
def greater_than_zero(some_value): return some_value > 0
df=inverted_index.apply(greater_than_zero).sum(axis=1)
inverted_index.sum(axis=1).min() # sanity check: should be 1, otherwise division by zero
def calculate_idf(some_value): return np.log10(col.collection_size/some_value)
idf=df.apply(calculate_idf)
hdf['idf']= idf 

### TF

Raw term frequency is what we already have when we look columnwise at the  *inverted_index* dataframe.
As discussed in the lecture, we will normalize this frequency by dividing with the raw frequency of the most frequent term in each document. Next, we then take the logarithm (any logarithm will do the job) since we assume that relevance does not increase linearly with term frequency.

In [120]:
#TODO: double-check this with slides 14 and 19 of lecture 4
most_frequent_term=inverted_index.max(axis=0) 
tf= (1+ np.log10(inverted_index)).div(np.log10(1+ most_frequent_term),axis=1)
#tf= (inverted_index.div(most_frequent_term,axis=1))
tf.replace([np.inf, -np.inf], 0,inplace=True)
hdf['tf']= tf

  This is separate from the ipykernel package so we can avoid doing imports until


### TFIDF
Bringing the pieces together.

In [171]:
tfidf=tf.multiply(idf)
tfidf= tf.apply(lambda x: x* (idf[0]))
hdf['tfidf']= tfidf

In [181]:
tfidf.head()

Unnamed: 0,MED-10,MED-14,MED-118,MED-301,MED-306,MED-329,MED-330,MED-332,MED-334,MED-335,...,MED-938,MED-939,MED-940,MED-892,MED-906,MED-917,MED-941,MED-942,MED-952,MED-961
'hort,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
+,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--all,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Unigram LM
.. has to come, structured the same way. 
.. 

### BIM 25
.. has to come, structured the same way. 
.. 

### Querying

@Philipp Naeser: please look at *query.py* which already specifies in a class how a query should look like.
Actually it may make sense to reinvert the inverted index, so you can pick the tfidf value columnwise.

In [196]:
#some example query:
tf.loc['cancer'] # if you transpose you can directly select by the index term  > tf.transpose().cancer

MED-10      1.564023
MED-14      1.624823
MED-118     0.000000
MED-301     0.000000
MED-306     0.000000
MED-329     0.000000
MED-330     0.000000
MED-332     0.000000
MED-334     0.000000
MED-335     0.000000
MED-398     0.000000
MED-557     0.000000
MED-666     0.000000
MED-691     0.000000
MED-692     0.000000
MED-702     1.249317
MED-706     0.000000
MED-707     0.000000
MED-708     0.000000
MED-709     0.000000
MED-711     0.000000
MED-712     1.418409
MED-713     0.000000
MED-714     0.000000
MED-716     0.000000
MED-717     0.000000
MED-718     0.000000
MED-719     0.000000
MED-720     0.000000
MED-721     0.000000
              ...   
MED-5363    0.000000
MED-5364    0.000000
MED-5365    0.000000
MED-5366    0.000000
MED-5367    0.000000
MED-5368    0.000000
MED-5369    0.000000
MED-5370    0.000000
MED-5371    0.000000
MED-847     0.000000
MED-873     1.302013
MED-874     1.895709
MED-875     0.000000
MED-905     0.000000
MED-914     0.000000
MED-915     0.000000
MED-924     0

In [201]:
tf.loc['cancer'].sort_values(ascending=False)

MED-3717    2.726833
MED-1414    2.726833
MED-3729    2.453445
MED-4534    2.292030
MED-4465    2.292030
MED-2067    2.292030
MED-2577    2.292030
MED-2435    2.183342
MED-4470    2.183342
MED-2439    2.183342
MED-4162    2.183342
MED-5184    2.183342
MED-1780    2.160964
MED-4464    2.160964
MED-5042    2.160964
MED-3748    2.160964
MED-1416    2.160964
MED-2087    2.160964
MED-2147    2.160964
MED-3172    2.113283
MED-4196    2.113283
MED-4825    2.113283
MED-4224    2.104077
MED-4433    2.104077
MED-3447    2.104077
MED-1599    2.104077
MED-4759    2.104077
MED-4752    2.104077
MED-3441    2.095903
MED-2578    2.058803
              ...   
MED-3714    0.000000
MED-3713    0.000000
MED-3712    0.000000
MED-3711    0.000000
MED-3710    0.000000
MED-3756    0.000000
MED-3757    0.000000
MED-3758    0.000000
MED-3775    0.000000
MED-3792    0.000000
MED-3788    0.000000
MED-3787    0.000000
MED-3786    0.000000
MED-3783    0.000000
MED-3780    0.000000
MED-3779    0.000000
MED-3778    0

So if you would enter the single query term 'cancer, then document 'MED-3717' would be ranked first, followed by   
'MED-1414',...