# TFIDF, BM25, UnigramLM

## 1.Collection and Vocabulary

Import *collection_vocabulary.py* and create an instance of the collection class. Its attributes are the collection itself, the vocabulary, and some descriptive summary statistics (e.g. vocabulary size, collection size, collection length).

**Table of Contents**
1. Collection and Vocabulary
2. Document Term Matrix
3. Inverted Index
4. TFIDF
5. Unigram LM with J-M-Smoothing
6. BM25
7. Embeddings
8. Queries

We will fix and use the parameters of the different models as discussed in the lecture. 

In [15]:
from collection_vocabulary import Collection
col=Collection()

## 2. Document Term Matrix
The document term matrix is obtained as a lists of lists, and then converted to a Pandas dataframe, which is stored to disk to facilitate debugging, and further experimentation.

In [16]:
doc_term_matrix=[]
for doc in col.collection:
    tf_vector =[]
    for word in col.vocabulary:
        n= col.collection[doc].count(word)
        tf_vector.append(n)
    doc_term_matrix.append(tf_vector)

In [17]:
import pandas as pd
import numpy as np
doc_term_matrix= pd.DataFrame(data=doc_term_matrix,index= col.collection.keys(),columns=col.vocabulary)
doc_term_matrix.to_pickle('doc_term_matrix.pkl')

In [18]:
doc_term_matrix.head(3) # this is how the doc term matrix looks like

Unnamed: 0,'hort,+,-,--a,--all,--have,--mainly,--of,--showed,--the,...,zooplankton,zoxazolamine,zr,zu,zuccarini,zucchini,zugesetztem,zusatzstoffe-online,zygote,zymography
MED-10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MED-14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MED-118,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Sanity Check: should have dimensions 3633*29052
doc_term_matrix.shape

(3633, 29052)

In [20]:
# some summary stats for our project report and a sanity check that would reveal any empty docs
doc_term_matrix.sum(axis=1).describe()

count    3633.000000
mean      146.204789
std        53.995711
min        11.000000
25%       114.000000
50%       148.000000
75%       174.000000
max       939.000000
dtype: float64

## 3. Inverted Index

The inverted index is our unified (and in practice memory-efficient) way of representing the document term matrix that we will use in the remainder of this project.



This is a good point to start off the actual analysis and calculate different retrieval models.

In [21]:
inverted_index= doc_term_matrix.transpose()
inverted_index.to_pickle('inverted_index.pkl') # use later for embeddings, queries, ... 

In [22]:
# sanity check 1
# each term should occur at least once (implied by the way we construct the index), hence min>=1
inverted_index.sum(axis=1).min()

1

## 4. TF-IDF

### IDF

In [29]:
df=(inverted_index>0).sum(axis=1)

In [30]:
raw_idf=(col.collection_size/df)

In [31]:
raw_idf.tail()

zucchini               3633.0
zugesetztem            3633.0
zusatzstoffe-online    3633.0
zygote                 3633.0
zymography             1816.5
dtype: float64

In [32]:
idf= np.log10(raw_idf) #aka log of raw_idf
idf.to_pickle('idf.pkl') #use the global idf scores for queries later
idf.tail()

zucchini               3.560265
zugesetztem            3.560265
zusatzstoffe-online    3.560265
zygote                 3.560265
zymography             3.259235
dtype: float64

In [33]:
# Sanity check: max tf score should be equal to number of docs in collection...
raw_idf.max().max()==3633

True

In [34]:
# Sanity check: ... and max idf score should be substantially lower
idf.max().max()

3.5602653978627146

### TF
Raw term frequency is what we obtain when we look columnwise at the  *inverted_index* dataframe.
As discussed in the lecture, we will normalize this frequency by dividing with the raw frequency of the most frequent term in each document. Next, we then take the logarithm (any logarithm will do the job) since we assume that relevance does not increase linearly with term frequency.

In [35]:
# nominator part
nominator=inverted_index.mask(inverted_index!=0,other=(np.log10(inverted_index)+1))

  


In [36]:
nominator.shape

(29052, 3633)

In [37]:
# denominator part
most_frequent_term=inverted_index.max(axis=0) # determine most frequent term in each doc
denominator= np.log10(most_frequent_term)
denominator+=1

In [38]:
denominator.shape

(3633,)

In [39]:
#sanity check, there shouldn't be any zeros
denominator.min()

1.0

In [40]:
#sanity check, there shouldn't be any zeros
nominator.min().min()

0.0

In [41]:
#tf
tf=nominator.div(denominator, axis=1)

### TFIDF
Bringing the pieces together.

In [42]:
tfidf= tf.mul(idf, axis=0) # we multiply the tf scores in every doc with the corresponding idf scores
tfidf.to_pickle('tfidf.pkl')

In [43]:
tfidf.shape

(29052, 3633)

# 5. Unigram LM with Jelinek-Mercer-Smoothing


### Global Language Model
We want to find out how likely each word is if we look at the whole corpus. 

In [24]:
global_LM=inverted_index.sum(axis=1)/col.collection_length # equal: inverted_index.sum(axis=1)/inverted_index.sum(axis=1).sum()

### Local Language Models
We want to obtain a language model for each document in the collection, therefore we look at the columns of the *inverted_index* dataframe. 

In [25]:
local_LMs=inverted_index/inverted_index.sum()

In [26]:
# Sanity Check: Probabilities should sum columnwise to 1, and adding all columns should yield the collection size (3633)
local_LMs.sum().sum()

3633.0

### Unigram LM with J-M-Smoothing
As introduced in the lecture, this smoothing scheme assigns equal weights to the global and local LMs.

In [27]:
unigram_LM= (local_LMs.apply(lambda x: x+ global_LM)).apply(lambda x: x/2)# same as multiplying both by 0.5 and adding them
unigram_LM.to_pickle('unigramLM.pkl')#writing Unigram-LM to disk

In [28]:
# sanity check: probabilities in every doc should sum up to one and all docs should sum up tp 3633
unigram_LM.sum().sum() 

3633.0000000012674

In [29]:
#sanity check: we don't want to have any negative values 
unigram_LM.min().min()<0

False

In [30]:
#sanity check: we don't want to have any zeros (since we are smoothing)
unigram_LM.isnull().sum(axis=1).sum()==0 # we check whether therer are no zeros > True intended

True

In [31]:
'''
omitted: operating in log-space to avoid numerical instability
global_LM_log_space=np.log(global_LM)
local_LMs_log_space=local_LMs.applymap(lambda x: np.log(x, out=np.zeros_like(inverted_index.as_matrix),where=x!=0))

 '''

'\nomitted: operating in log-space to avoid numerical instability\nglobal_LM_log_space=np.log(global_LM)\nlocal_LMs_log_space=local_LMs.applymap(lambda x: np.log(x, out=np.zeros_like(inverted_index.as_matrix),where=x!=0))\n\n '

# 6. BIM 25
Let's approach BIM25 step by step, which means modeling the BIM and then gradually extending it. 
We start from the naive assumption that we do not have any relevance feedbacks.


### BIM 
This simplification results in the following formula we want to compute:
w_t= log(0.5 * N/N_t)

N_t signifies in how many documents a term appears. This is what we already calucalted as the 'raw' document frequency in the TFIDF-model above. 
What we are basically doing is multiplying the raw inverse document frequency by 0.5 and then taking the logarithm.

Note: This can (and is intended to) produce negative values for words occuring in almost every document.

In [32]:
BIM= np.log10(raw_idf*0.5) # raw idf calculated above in tfidf
BIM.head()

'hort    3.259235
+        2.782114
-        3.259235
--a      3.259235
--all    3.259235
dtype: float64

In [33]:
# observation: in BIM 25 weights may actually become negative - we have four negative weights
sum(BIM<0)

4

### BM 25
Let's focus on the weighting part and then multiply these weights with the BIM weights from above.

In [34]:
# parameters as presented in the lecture
k=1.5
b=0.25
document_lenght= inverted_index.sum()
average_document_length= col.collection_length/col.collection_size # 146.20478943022295 TODO: include in project report
doc_len_div_by_avg_doc_len= document_lenght/average_document_length
#sanity check, should yield 3633
doc_len_div_by_avg_doc_len.sum() == 3633

False

In [35]:
weighting_bim25_nominator= inverted_index*k*(k+1)
weighting_bim25_nominator.shape

(29052, 3633)

In [36]:
#the denominator is the tricky part since we have to add scalars and a vector to each column in the inverted index at the same time
weighting_bim25_denominator=inverted_index.add((doc_len_div_by_avg_doc_len*k*b), axis=1)+(k*(1-b))
weighting_bim25_denominator.shape

(29052, 3633)

In [37]:
#merging nominator and denominator
weighting_bim25= weighting_bim25_nominator.div(weighting_bim25_denominator)

In [38]:
#sanity check: 29052, 3633 ?
weighting_bim25.shape

(29052, 3633)

Combining the weights, and the vanilla BIM from above, we can now construct BIM25.

In [39]:
BIM25=weighting_bim25.mul(BIM, axis=0)
BIM25.to_pickle('BIM25.pkl')

# 7. Embeddings

Further information to the embeddings can be found in Word Embeddings.ipynb, here we only use the parts for feature generation.

We use two versions of embeddings here, fasttext and fasttext.word2vec

In [6]:
# Preprocessing
# Gensim requires list of lists of Unicode 8 strings as an input. Since we have a small collection, 
# we are fine with loading everything into memory.
import re
doc_list= []
with open('./nfcorpus/raw/doc_dump.txt', 'r', encoding='utf-8') as rf1:
    for line in rf1:
        l = re.sub("MED-.*\t", "",line).lower().strip('\n').split()
        doc_list.append(l) 
len(doc_list)

5371

In [7]:
import gensim
gensim.models.fasttext.FAST_VERSION > -1 # make sure that you are using Cython backend

True

In [8]:
#Run this to create a fasttext model of our documents
#fasttext= gensim.models.FastText(bigram[doc_list], min_count= 1, min_n= 3, max_n=12)
fasttext= gensim.models.FastText(doc_list, min_count= 1, min_n= 3, max_n=12)
fasttext.save('our_fasttext')

In [9]:
#Same as above, run this to compute the model, or run next cell to load it (if it exists on disk already)
word2vec= gensim.models.FastText(doc_list, min_count= 1, word_ngrams=0)
word2vec.save('our_fasttextword2vec')

# 8. Querying

## Note that this first step is only necessary if you want a shortcut, and already computed the documents some time before

In [1]:
#To speed things up, load the pickle files. If you just ran the whole script, you do not need this step
#Note: pkl files are excluded from git for being to large, so you have to run the whole script once
import pandas as pd
import numpy as np
import gensim
tfidf = pd.read_pickle('tfidf.pkl')
BIM25 = pd.read_pickle('BIM25.pkl')
unigram_LM = pd.read_pickle('unigramLM.pkl')
idf = pd.read_pickle('idf.pkl')
from collection_vocabulary import Collection
col=Collection()
# this loads the whole model, (not only the vectors)
fasttext = gensim.models.FastText.load('our_fasttext')
word2vec = gensim.models.FastText.load('our_fasttextword2vec')
inverted_index = pd.read_pickle('inverted_index.pkl')



### Sample single term queries

Let's look at the same single term query  - "cancer". And compare the results of the three retrieval models.

In [40]:
#TFIDF
a= tfidf.loc['cancer'].sort_values(ascending=False).head(10) # if you transpose you can directly select by the index term  > tf.transpose().cancer
a

MED-3718    0.695754
MED-2322    0.695754
MED-5063    0.695754
MED-2137    0.695754
MED-4096    0.695754
MED-4643    0.695754
MED-4117    0.695754
MED-1414    0.695754
MED-3551    0.695754
MED-3550    0.695754
Name: cancer, dtype: float64

In [41]:
# Unigram LM
b= unigram_LM.loc['cancer'].sort_values(ascending=False).head(10)
b

MED-3703    0.081339
MED-2137    0.061650
MED-2174    0.057772
MED-4391    0.052210
MED-890     0.048745
MED-5184    0.048281
MED-3551    0.047909
MED-3555    0.047347
MED-2258    0.045462
MED-3699    0.044962
Name: cancer, dtype: float64

In [42]:
c= BIM25.loc['cancer'].sort_values(ascending=False).head(10)
c

MED-3703    0.917499
MED-1721    0.895649
MED-2760    0.894057
MED-3699    0.892689
MED-3555    0.884759
MED-14      0.881640
MED-4928    0.880848
MED-5353    0.877033
MED-4050    0.876367
MED-4785    0.876234
Name: cancer, dtype: float64

Obviously, there is very little overlap in the top 10 retrieved documents. Only the top-ranked doc of the probabilisitic ranking models matches.

Now, let's get the query representations and compute the scores for each document

## Here the query part really starts

In [2]:
#Now to get the queries
train_queries = pd.read_csv('nfcorpus/train.all.queries', sep='\t', header=None)
train_queries.columns = ['id', 'text']
dev_queries = pd.read_csv('nfcorpus/dev.all.queries', sep='\t', header=None)
dev_queries.columns = ['id', 'text']
test_queries = pd.read_csv('nfcorpus/test.all.queries', sep='\t', header=None)
test_queries.columns = ['id', 'text']

#And the relevance scores given
train_rel = pd.read_csv('nfcorpus/train.3-2-1.qrel', sep='\t', header=None)
print(train_rel.describe())
test_rel = pd.read_csv('nfcorpus/test.3-2-1.qrel', sep='\t', header=None)
print(test_rel.describe())
dev_rel = pd.read_csv('nfcorpus/dev.3-2-1.qrel', sep='\t', header=None)
#As we can see, column 1 is always 0, so drop it
train_rel = train_rel.drop([1], axis=1)
dev_rel = dev_rel.drop([1], axis=1)
test_rel = test_rel.drop([1], axis=1)
train_rel.columns = ['qid', 'docid', 'rel']
dev_rel.columns = ['qid', 'docid', 'rel']
test_rel.columns = ['qid', 'docid', 'rel']

#The corpus also divides documents into train, dev and test, so we need to stick to that as well
#(in order to get comparable results)
train_docs = pd.read_csv('nfcorpus/train.docs', sep='\t', header=None)
train_docs.columns = ['id', 'text']
dev_docs = pd.read_csv('nfcorpus/dev.docs', sep='\t', header=None)
dev_docs.columns = ['id', 'text']
test_docs = pd.read_csv('nfcorpus/test.docs', sep='\t', header=None)
test_docs.columns = ['id', 'text']

              1              3
count  139350.0  139350.000000
mean        0.0       1.824212
std         0.0       0.454204
min         0.0       1.000000
25%         0.0       2.000000
50%         0.0       2.000000
75%         0.0       2.000000
max         0.0       3.000000
             1             3
count  15820.0  15820.000000
mean       0.0      1.816056
std        0.0      0.472168
min        0.0      1.000000
25%        0.0      2.000000
50%        0.0      2.000000
75%        0.0      2.000000
max        0.0      3.000000


In [4]:
#you can skip this if you already did it once, just start loading the matrices from pkl files
def get_query_term_matrix(queries, col):
    query_term_matrix = []
    for query in queries.itertuples():
        tf_vector =[]
        for word in col.vocabulary:
            n= query.text.count(word)
            tf_vector.append(n)
        query_term_matrix.append(tf_vector)
    return pd.DataFrame(data=query_term_matrix,index=queries.id,columns=col.vocabulary)

In [5]:
#let's compute the term_matrix for our query texts
train_matrix = get_query_term_matrix(train_queries, col)
dev_matrix = get_query_term_matrix(dev_queries, col)
test_matrix = get_query_term_matrix(test_queries, col)

In [7]:
test_matrix.head()

Unnamed: 0_level_0,'hort,+,-,--a,--all,--have,--mainly,--of,--showed,--the,...,zooplankton,zoxazolamine,zr,zu,zuccarini,zucchini,zugesetztem,zusatzstoffe-online,zygote,zymography
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PLAIN-1008,0,0,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PLAIN-1018,0,0,11,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PLAIN-102,0,1,54,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PLAIN-1028,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PLAIN-1039,0,0,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
#Another speed up, save the matrices
train_matrix.to_pickle('queries/train_matrix.pkl')
dev_matrix.to_pickle('queries/dev_matrix.pkl')
train_matrix.to_pickle('queries/test_matrix.pkl')

## If you already ran this once, you can simply load the matrices instead of computing them

In [3]:
train_matrix = pd.read_pickle('queries/train_matrix.pkl')
dev_matrix = pd.read_pickle('queries/dev_matrix.pkl')
test_matrix = pd.read_pickle('queries/test_matrix.pkl')

In [9]:
# As seen in the tf idf computation of the documents, we will use an inverted matrix
train_matrix_inverted = train_matrix.transpose()
dev_matrix_inverted = dev_matrix.transpose()
test_matrix_inverted = test_matrix.transpose()

In [10]:
def compute_query_tfidf(inv_query_matrix, idf):
    #TF
    # nominator part
    nominator=inv_query_matrix.mask(inv_query_matrix!=0,other=(np.log10(inv_query_matrix)+1))
    # denominator part
    most_frequent_term=inv_query_matrix.max(axis=0) # determine most frequent term in each query
    denominator= np.log10(most_frequent_term)
    denominator+=1
    tf=nominator.div(denominator, axis=1)
    tfidf_query= tf.mul(idf, axis=0) # we multiply the tf scores in every query with the corresponding idf scores
    return tfidf_query

In [11]:
#Now, let's get the tfidf scores for each query, please ignore the error
train_tfidf = compute_query_tfidf(train_matrix_inverted, idf)
dev_tfidf = compute_query_tfidf(dev_matrix_inverted, idf)
test_tfidf = compute_query_tfidf(test_matrix_inverted, idf)

  after removing the cwd from sys.path.


In [12]:
#Save those as pkl as well, because that also takes quite a while to compute
train_tfidf.to_pickle('queries/train_tfidf.pkl')
dev_tfidf.to_pickle('queries/dev_tfidf.pkl')
test_tfidf.to_pickle('queries/test_tfidf.pkl')

## Again, load it to save time

In [4]:
#Or load them, if they already exist
train_tfidf = pd.read_pickle('queries/train_tfidf.pkl')
dev_tfidf = pd.read_pickle('queries/dev_tfidf.pkl')
test_tfidf = pd.read_pickle('queries/test_tfidf.pkl')

In [24]:
test_tfidf.head()

id,PLAIN-1008,PLAIN-1018,PLAIN-102,PLAIN-1028,PLAIN-1039,PLAIN-1050,PLAIN-1066,PLAIN-1078,PLAIN-1088,PLAIN-1098,...,PLAIN-91,PLAIN-913,PLAIN-924,PLAIN-934,PLAIN-946,PLAIN-956,PLAIN-966,PLAIN-977,PLAIN-987,PLAIN-997
'hort,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
+,0.0,0.0,0.750049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-,2.529077,2.433363,2.366582,2.628729,2.621103,2.5047,2.604743,2.726138,2.44186,2.918655,...,1.971896,1.79211,2.682064,1.856777,2.744308,2.022176,2.561169,2.587771,2.434983,2.680533
--a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
--all,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Prepare the embedding models:

In [5]:
fasttext_embeddings_list=[]
words_not_covered_in_fasttext=[]
for word in inverted_index.index:
    try:
        fasttext_embeddings_list.append(fasttext.wv.get_vector(word))
    except:
        words_not_covered_in_fasttext.append(word)
        fasttext_embeddings_list.append(np.zeros(100)) # for those 3 OOV we insert an array consisting of zeros
fasttext_embeddings=pd.Series(fasttext_embeddings_list,index=inverted_index.index)
fasttext_embeddings.head()

'hort    [0.727683, -0.649744, 0.607231, -0.566998, 0.4...
+        [-0.315303, -0.304781, 0.192003, 0.625403, 0.4...
-        [-0.414824, -0.482533, 0.302804, 0.319794, 0.4...
--a      [-0.0802856, -0.321893, 0.167339, 0.024268, 0....
--all    [0.520891, -0.499456, -0.229592, 0.0148075, 0....
dtype: object

In [6]:
#Word2Vec Embeddings, 100-d dense vector
word2vec_embeddings_list=[]
words_not_covered_in_word2vec=[]
for word in inverted_index.index:
    try:
        word2vec_embeddings_list.append(word2vec.wv.get_vector(word))
    except:
        words_not_covered_in_word2vec.append(word)
        word2vec_embeddings_list.append(np.zeros(100)) # for those 3 OOV we insert an array consisting of zeros
word2vec_embeddings=pd.Series(word2vec_embeddings_list,index=inverted_index.index)
word2vec_embeddings.head()

'hort    [-0.713965, -0.403524, 0.314923, -1.19988, -0....
+        [-0.605538, -0.46913, -0.150712, 1.10127, -0.1...
-        [-0.904962, -0.500117, -0.138728, 1.11224, -0....
--a      [-0.277904, -0.0720914, 0.105588, 0.0845153, -...
--all    [0.328376, -0.181311, -0.67753, -1.00159, 0.55...
dtype: object

In [None]:
def get_weighted_embeddings(embeddings, tfidf_embed):
    sum_of_tfidf_weights=tfidf_embed.sum(axis=0)#vector containing the normalizing constant for each doc
    weighted_embeddings=tfidf_embed.mask(tfidf_embed!=0, other=(tfidf_embed*embeddings).div(sum_of_tfidf_weights))
    print('done')
    return weighted_embeddings

In [None]:
documents_fasttext = get_weighted_embeddings(fasttext_embeddings, tfidf)
train_queries_fasttext = get_weighted_embeddings(fasttext_embeddings, train_tfidf)
dev_queries_fasttext = get_weighted_embeddings(fasttext_embeddings, dev_tfidf)
test_queries_fasttext = get_weighted_embeddings(fasttext_embeddings, test_tfidf)
#Let's save those again, as computing them might take a while
documents_fasttext.to_pickle('documents_fasttext.pkl')
train_queries_fasttext.to_pickle('queries/train_queries_fasttext.pkl')
dev_queries_fasttext.to_pickle('queries/dev_queries_fasttext.pkl')
test_queries_fasttext.to_pickle('queries/test_queries_fasttext.pkl')

In [None]:
documents_word2vec= get_weighted_embeddings(word2vec_embeddings, tfidf)
train_queries_word2vec = get_weighted_embeddings(word2vec_embeddings, train_tfidf)
dev_queries_word2vec = get_weighted_embeddings(word2vec_embeddings, dev_tfidf)
test_queries_word2vec = get_weighted_embeddings(word2vec_embeddings, test_tfidf)
#Save them as well
documents_word2vec.to_pickle('documents_word2vec.pkl')
train_queries_word2vec.to_pickle('queries/train_queries_word2vec.pkl')
dev_queries_word2vec.to_pickle('queries/dev_queries_word2vec.pkl')
test_queries_word2vec.to_pickle('queries/test_queries_word2vec.pkl')

## Load the weighted embeddings, if you precomputed them

In [None]:
documents_fasttext = pd.read_pickle('documents_fasttext.pkl')
train_queries_fasttext = pd.read_pickle('queries/train_queries_fasttext.pkl')
dev_queries_fasttext = pd.read_pickle('queries/dev_queries_fasttext.pkl')
test_queries_fasttext = pd.read_pickle('queries/test_queries_fasttext.pkl')
documents_word2vec = pd.read_pickle('documents_word2vec.pkl')
train_queries_word2vec = pd.read_pickle('queries/train_queries_word2vec.pkl')
dev_queries_word2vec = pd.read_pickle('queries/dev_queries_word2vec.pkl')
test_queries_word2vec = pd.read_pickle('queries/test_queries_word2vec.pkl')

In [12]:
#this function assumes that you either ran the whole code or ran the shortcut step and the code from there on
def compute_scores(queries, documents, rel, queries_tfidf):#, queries_fasttext, queries_word2vec):
    #Get the documents defined in the nfcorpous
    doc_keys = documents.id
    tfidf_part = tfidf.loc[:, doc_keys]
    BIM25_part = BIM25.loc[:, doc_keys]
    unigram_LM_part = unigram_LM.loc[:, doc_keys]
    #Get the cosine between queries and docs (much faster than inside the loop)
    cosine = cosine_similarity(queries_tfidf, tfidf_part.transpose())
    list_of_df = []
    query_keys = queries['id']
    print('Computing', len(query_keys), 'queries on', len(doc_keys), 'documents')
    i = 0
    for key in query_keys:
        text = str(queries.loc[queries['id'] == key].text)
        tfidf_scores = tfidf_part.loc[text.split()].sum()
        bim25_scores = BIM25_part.loc[text.split()].sum()
        unigram_scores = unigram_LM_part.loc[text.split()].product()
        cosine_scores = cosine[i]
        total = pd.DataFrame()
        total['tfidf'] = tfidf_scores
        total['bim25'] = bim25_scores
        total['unigram'] = unigram_scores
        total['cosine'] = cosine_scores
        total['qid'] = key.replace('PLAIN-', '')
        #Rel only contains 1 and 2, everything that is not in there is set to 0
        total['rel'] = 0
        rel_temp = rel.loc[(rel['qid'] == key)]
        for row in rel_temp.itertuples():
            total.at[row.docid, 'rel'] = row.rel
        total.set_index(np.arange(len(doc_keys)))
        total.rename(columns={'': 'docid'}, inplace=True)
        list_of_df.append(total)
        i+=1
        if (i%100 == 0):
            print(i, 'queries computed')
    scores = pd.concat(list_of_df)
    print(i, 'queries computed')
    #Sanity check: should be same
    print(len(scores))
    print(len(doc_keys)*len(query_keys))
    return scores

In [13]:
from scipy.spatial import distance
def cosine_similarity(query, docs):
    cos_similarity = 1-distance.cdist(query, docs, metric='cosine')
    return cos_similarity

In [14]:
train_scores = compute_scores(train_queries, train_docs, train_rel, train_tfidf.transpose())
dev_scores = compute_scores(dev_queries, dev_docs, dev_rel, dev_tfidf.transpose())
test_scores = compute_scores(test_queries, test_docs, test_rel, test_tfidf.transpose())

Computing 2594 queries on 3612 documents
100 queries computed
200 queries computed
300 queries computed
400 queries computed
500 queries computed
600 queries computed
700 queries computed
800 queries computed
900 queries computed
1000 queries computed
1100 queries computed
1200 queries computed
1300 queries computed
1400 queries computed
1500 queries computed
1600 queries computed
1700 queries computed
1800 queries computed
1900 queries computed
2000 queries computed
2100 queries computed
2200 queries computed
2300 queries computed
2400 queries computed
2500 queries computed
2594 queries computed
9369528
9369528
Computing 325 queries on 3193 documents
100 queries computed
200 queries computed
300 queries computed
325 queries computed
1037725
1037725
Computing 325 queries on 3162 documents
100 queries computed
200 queries computed
300 queries computed
325 queries computed
1027650
1027650


In [15]:
train_scores

Unnamed: 0,tfidf,bim25,unigram,cosine,qid,rel
MED-10,0.000000,0.000000,2.314569e-23,0.015284,10,0
MED-14,0.000000,0.000000,2.314569e-23,0.010135,10,0
MED-118,0.000000,0.000000,2.314569e-23,0.024806,10,0
MED-301,0.000000,0.000000,2.314569e-23,0.019779,10,0
MED-306,0.000000,0.000000,2.314569e-23,0.030273,10,0
MED-329,0.727490,0.931327,1.404000e-22,0.016845,10,0
MED-330,0.000000,0.000000,2.314569e-23,0.014549,10,0
MED-332,0.000000,0.000000,2.314569e-23,0.027112,10,0
MED-334,0.000000,0.000000,2.314569e-23,0.022925,10,0
MED-335,0.000000,0.000000,2.314569e-23,0.020251,10,0


In [16]:
#Here you can see that rel is not always 0
test_scores.describe()

Unnamed: 0,tfidf,bim25,unigram,cosine,rel
count,1027650.0,1027650.0,1027650.0,1027650.0,1027650.0
mean,0.1121619,0.1651769,2.65621e-08,0.01158282,0.02795699
std,0.3264698,0.5390701,8.205315e-06,0.01065434,0.2311314
min,0.0,0.0,1.4853279999999998e-44,0.0,0.0
25%,0.0,0.0,5.475276e-35,0.002505088,0.0
50%,0.0,0.0,2.436599e-31,0.009550024,0.0
75%,0.0,0.0,1.934189e-25,0.01772193,0.0
max,10.7151,19.84373,0.006253765,0.1456515,3.0


For Full task: use the generated scores to train and evaluate point and pairwise models

In [17]:
train_scores.to_pickle('queries/train_scores.pkl')
dev_scores.to_pickle('queries/dev_scores.pkl')
test_scores.to_pickle('queries/test_scores.pkl')

For reduced task: create files according to rank lib documentation: https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/

In [18]:
#Create csv for Ranklib, code taken from answer here: https://stackoverflow.com/questions/37439533/pandas-custom-file-format
feature_columns = ['tfidf','bim25','unigram','cosine']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}

def f(x):
    if x.name in feature_columns:
        return cols2id[x.name] + ':' + x.astype(str)
    elif x.name == 'qid':
        return 'qid:' + x.astype(str)
    else:
        return x

(train_scores.apply(lambda x: f(x))[['rel','qid'] + feature_columns]
  .to_csv('reduced_task/train.csv', sep=' ', index=False, header=None)
)
(dev_scores.apply(lambda x: f(x))[['rel','qid'] + feature_columns]
  .to_csv('reduced_task/dev.csv', sep=' ', index=False, header=None)
)
(test_scores.apply(lambda x: f(x))[['rel','qid'] + feature_columns]
  .to_csv('reduced_task/test.csv', sep=' ', index=False, header=None)
)

In [22]:
#How many uniquely relevant documents do we have (or why does raw contain 5000 docs, and train/dev/test together only 3633)

docids_train = train_rel.docid
docids_dev = dev_rel.docid
docids_test = test_rel.docid
docids = pd.concat([docids_train, docids_dev, docids_test])
len(docids.unique())

3633