### Problem statement

Develo a ranked retrieval system which takes free text queries from the user, ranks the documents according to the relevance and returns five most relevant documents (i.e., IDs of the top 5 documents)

### Dataset
Subset of Cranfield collection, published in 1960s. It consists of 5 files.

- Documents.csv- 387 aerodynamics journal articles' abstracts
- queries.csv- 85 free text queries for training
- qrel.csv- contains 5 relevant documents for every query in training set.
- queries_val.csv- 22 free text queries for validation
- qrel_val.csv- Contains 5 relevant documents for every query in validation set.



### Steps to build a ranked retrieval system

- Load the dataset
- Text pre-processing
- Ranking documents and evaluation using MAP
- We will use the following
    - Jaccard Coefficient
    - Term Frequency
    - Inverse Document Frequency
    - TF-IDF
    - TF-IDF based Vector space model

In [93]:
# Importing libraries
import numpy as np
import pandas as pd
import re
import spacy

#loading model
nlp= spacy.load('en_core_web_sm')

### Loading Dataset

In [94]:
#Documents
documents= pd.read_csv('dataset/documents.csv')
print("Shape :", documents.shape)
documents.head()

Shape : (387, 5)


Unnamed: 0,docid,author,bibliography,body,title
0,2,ting-yili,"department of aeronautical engineering, rensse...",simple shear flow past a flat plate in an inco...,simple shear flow past a flat plate in an inco...
1,3,m. b. glauert,"department of mathematics, university of manch...",the boundary layer in simple shear flow past a...,the boundary layer in simple shear flow past a...
2,5,"wasserman,b.","j. ae. scs. 24, 1957, 924.",one-dimensional transient heat conduction into...,one-dimensional transient heat conduction into...
3,6,"campbell,w.f.","j. ae. scs. 25, 1958, 340.",one-dimensional transient heat flow in a multi...,one-dimensional transient heat flow in a multi...
4,12,"bisplinghoff,r.l.","j. ae. scs. 23, 1956, 289.",some structural and aerelastic considerations ...,some structural and aerelastic considerations ...


In [95]:
#Queries- Queries with query ID
queries= pd.read_csv('dataset/queries.csv')
print("Shape :", queries.shape)
queries.head()

Shape : (85, 2)


Unnamed: 0,qid,query
0,1,what similarity laws must be obeyed when const...
1,2,what are the structural and aeroelastic proble...
2,3,what problems of heat conduction in composite ...
3,8,what methods -dash exact or approximate -dash ...
4,10,are real-gas transport properties for air avai...


In [96]:
#Qrel- gives relevant document list
qrel= pd.read_csv('dataset/qrel.csv')
print("Shape :", qrel.shape)
qrel.head()

Shape : (425, 2)


Unnamed: 0,qid,docid
0,1,184
1,1,29
2,1,31
3,1,57
4,1,378


In [97]:
#queries_val: Validation queries

queries_val= pd.read_csv('dataset/queries_val.csv')
print("Shape :", queries_val.shape)
queries_val.head()

Shape : (22, 2)


Unnamed: 0,qid,query
0,189,is there a design method for calculating therm...
1,190,will an analysis of panel flutter based on arb...
2,191,"what is the criterion for true panel flutter, ..."
3,194,how can the analytical solution of the bucklin...
4,196,the problem of similarity for representative i...


In [98]:
#qrel_val: List of relevant docs for validation queries

qrel_val= pd.read_csv('dataset/qrel_val.csv')
print("Shape :", qrel_val.shape)
qrel_val.head()

Shape : (110, 2)


Unnamed: 0,qid,docid
0,189,395
1,189,866
2,189,869
3,189,865
4,189,868


### Loading sample queries and documents

In [99]:
#Reading sample queries
queries['query'].sample(10).values

array(['what progress has been made in research on unsteady aerodynamics .',
       'how far around a cylinder and under what conditions of flow, if any, is the velocity just outside of the boundary layer a linear function of the distance around the cylinder .',
       'can series expansions be found for the boundary layer on a flat plate in a shear flow .',
       'what is the combined effect of surface heat and mass transfer on hypersonic flow .',
       'does a practical flow follow the theoretical concepts for the interaction between adjacent blade rows of a supersonic cascade .',
       'why do users of orthodox pitot-static tubes often find that the calibrations appear to be,. - (a) significantly different from those formerly specified,  (b) wildly variable at low reynolds numbers .',
       'are previous analyses of circumferential thermal buckling of circular cylindrical shells unnecessarily involved or even inaccurate due to the assumed forms of buckling mode .',
       'are t

In [100]:
#Reading sample documents

documents['body'][:5].values

array(["simple shear flow past a flat plate in an incompressible fluid of small viscosity . in the study of high-speed viscous flow past a two-dimensional body it is usually necessary to consider a curved shock wave emitting from the nose or leading edge of the body .  consequently, there exists an inviscid rotational flow region between the shock wave and the boundary layer .  such a situation arises, for instance, in the study of the hypersonic viscous flow past a flat plate .  the situation is somewhat different from prandtl's classical boundary-layer problem . in prandtl's original problem the inviscid free stream outside the boundary layer is irrotational while in a hypersonic boundary-layer problem the inviscid free stream must be considered as rotational .  the possible effects of vorticity have been recently discussed by ferri and libby .  in the present paper, the simple shear flow past a flat plate in a fluid of small viscosity is investigated .  it can be shown that this pro

### Text pre-processing

The text appears to be all in lower case. But there are still some hyphens, extra spaces etc., which we can remove through regex.

In [101]:
def preprocess(text):
    #split on hyphen and replace with blank spaces
    text= re.sub("-"," ", text)
    
    #keep only the words
    text= re.sub("[^a-z ]+", "", text)
    
    #removing extra spaces
    text= re.sub("[\s]+", " ", text)
    
    #creating doc object
    doc= nlp(text)
    
    # remove stopwords and lemmatize the text
    tokens= [token.lemma_ for token in doc if(token.is_stop==False)]
    
    return tokens

In [102]:
#preprocessing on documents
documents['tokens']= documents['body'].apply(preprocess)
documents.head()

Unnamed: 0,docid,author,bibliography,body,title,tokens
0,2,ting-yili,"department of aeronautical engineering, rensse...",simple shear flow past a flat plate in an inco...,simple shear flow past a flat plate in an inco...,"[simple, shear, flow, past, flat, plate, incom..."
1,3,m. b. glauert,"department of mathematics, university of manch...",the boundary layer in simple shear flow past a...,the boundary layer in simple shear flow past a...,"[boundary, layer, simple, shear, flow, past, f..."
2,5,"wasserman,b.","j. ae. scs. 24, 1957, 924.",one-dimensional transient heat conduction into...,one-dimensional transient heat conduction into...,"[dimensional, transient, heat, conduction, dou..."
3,6,"campbell,w.f.","j. ae. scs. 25, 1958, 340.",one-dimensional transient heat flow in a multi...,one-dimensional transient heat flow in a multi...,"[dimensional, transient, heat, flow, multilaye..."
4,12,"bisplinghoff,r.l.","j. ae. scs. 23, 1956, 289.",some structural and aerelastic considerations ...,some structural and aerelastic considerations ...,"[structural, aerelastic, consideration, high, ..."


In [103]:
#preprocessing on queries
queries['tokens']= queries['query'].apply(preprocess)
queries.head()

Unnamed: 0,qid,query,tokens
0,1,what similarity laws must be obeyed when const...,"[similarity, law, obey, construct, aeroelastic..."
1,2,what are the structural and aeroelastic proble...,"[structural, aeroelastic, problem, associate, ..."
2,3,what problems of heat conduction in composite ...,"[problem, heat, conduction, composite, slab, s..."
3,8,what methods -dash exact or approximate -dash ...,"[method, dash, exact, approximate, dash, prese..."
4,10,are real-gas transport properties for air avai...,"[real, gas, transport, property, air, availabl..."


In [104]:
#pre-processing on queries for validation
queries_val['tokens']= queries_val['query'].apply(preprocess)
queries_val.head()

Unnamed: 0,qid,query,tokens
0,189,is there a design method for calculating therm...,"[design, method, calculate, thermal, fatigue, ..."
1,190,will an analysis of panel flutter based on arb...,"[analysis, panel, flutter, base, arbitrarily, ..."
2,191,"what is the criterion for true panel flutter, ...","[criterion, true, panel, flutter, oppose, smal..."
3,194,how can the analytical solution of the bucklin...,"[analytical, solution, buckle, strength, unifo..."
4,196,the problem of similarity for representative i...,"[problem, similarity, representative, investig..."


### Ranking documents and evaluation using MAP

#### Jaccard Coefficient

In [105]:
# Creating a temporary dataframe

temp_doc= documents[['docid', 'tokens']].copy()

In [106]:
def jaccard_coefficient(dtokens, qtokens):
    #calculating A intersection B
    numerator= len(set(dtokens).intersection(set(qtokens)))
    #calculating A union B
    denominator= len(set(dtokens).union(set(qtokens)))
    
    return numerator/denominator

In [107]:
# Calculating the jaccard coefficient for a sample query-document pair

jaccard_coefficient(temp_doc['tokens'][0], queries['tokens'][0])

0.02702702702702703

In [108]:
# getting Jaccard coefficient for all the documents against a sample query

temp_doc['jaccard']= temp_doc['tokens'].apply(lambda x: jaccard_coefficient(x, queries['tokens'][0]))
temp_doc.head(10)

Unnamed: 0,docid,tokens,jaccard
0,2,"[simple, shear, flow, past, flat, plate, incom...",0.027027
1,3,"[boundary, layer, simple, shear, flow, past, f...",0.0
2,5,"[dimensional, transient, heat, conduction, dou...",0.028571
3,6,"[dimensional, transient, heat, flow, multilaye...",0.020408
4,12,"[structural, aerelastic, consideration, high, ...",0.084746
5,15,"[dimensional, panel, flutter, theory, experime...",0.0
6,16,"[transformation, compressible, turbulent, boun...",0.0
7,21,"[heat, transfer, slip, flow, number, author, c...",0.030303
8,23,"[skin, friction, heat, transfer, characteristi...",0.017857
9,24,"[theory, stagnation, point, heat, transfer, di...",0.032609


In [109]:
#DocIDs of the top 10 most relevant documents
temp_doc.sort_values(by='jaccard', ascending= False).head(10).reset_index(drop=True)

Unnamed: 0,docid,tokens,jaccard
0,12,"[structural, aerelastic, consideration, high, ...",0.084746
1,51,"[theory, aircraft, structural, model, subject,...",0.084746
2,378,"[engineering, relation, friction, heat, transf...",0.073171
3,670,"[blunt, body, heat, transfer, hypersonic, spee...",0.066667
4,875,"[model, aeroelastic, investigation, addendum, ...",0.066667
5,184,"[scale, model, thermo, aeroelastic, research, ...",0.057971
6,1111,"[research, high, speed, flutter, paper, presen...",0.057143
7,436,"[heat, transfer, planetary, atmosphere, super,...",0.055556
8,629,"[second, order, effect, laminar, boundary, lay...",0.055556
9,1305,"[propose, programme, wind, tunnel, test, hyper...",0.055556


In [110]:
#DocID of top 5 most relevant documents
temp_doc.sort_values(by='jaccard', ascending=False).head()['docid'].values

array([ 12,  51, 378, 670, 875])

In [111]:
# Function for finding jaccard_coefficient
def jaccard_rank(qtokens):
  # Find jaccard coefficient for all docs
  temp_doc['jaccard']=temp_doc['tokens'].apply(lambda x: jaccard_coefficient(x,qtokens))

  # Find top 5 most relevant docs
  relevant_docids=temp_doc.sort_values(by='jaccard',ascending=False).head()['docid'].values
  return relevant_docids

In [112]:
#Ranking documents according to jaccard coefficient
queries['jaccard_rel']= queries['tokens'].apply(lambda x: jaccard_rank(x))
queries.head()

Unnamed: 0,qid,query,tokens,jaccard_rel
0,1,what similarity laws must be obeyed when const...,"[similarity, law, obey, construct, aeroelastic...","[12, 51, 378, 670, 875]"
1,2,what are the structural and aeroelastic proble...,"[structural, aeroelastic, problem, associate, ...","[12, 51, 700, 746, 875]"
2,3,what problems of heat conduction in composite ...,"[problem, heat, conduction, composite, slab, s...","[5, 584, 6, 145, 582]"
3,8,what methods -dash exact or approximate -dash ...,"[method, dash, exact, approximate, dash, prese...","[122, 1306, 639, 655, 988]"
4,10,are real-gas transport properties for air avai...,"[real, gas, transport, property, air, availabl...","[405, 302, 436, 583, 616]"


In [113]:
# Adding ground truth in a column
queries['ground_truth']= queries['qid'].apply(lambda x:qrel[qrel['qid']==x]['docid'].values)
queries.head()

Unnamed: 0,qid,query,tokens,jaccard_rel,ground_truth
0,1,what similarity laws must be obeyed when const...,"[similarity, law, obey, construct, aeroelastic...","[12, 51, 378, 670, 875]","[184, 29, 31, 57, 378]"
1,2,what are the structural and aeroelastic proble...,"[structural, aeroelastic, problem, associate, ...","[12, 51, 700, 746, 875]","[12, 746, 15, 184, 858]"
2,3,what problems of heat conduction in composite ...,"[problem, heat, conduction, composite, slab, s...","[5, 584, 6, 145, 582]","[5, 6, 90, 91, 119]"
3,8,what methods -dash exact or approximate -dash ...,"[method, dash, exact, approximate, dash, prese...","[122, 1306, 639, 655, 988]","[48, 122, 354, 360, 1005]"
4,10,are real-gas transport properties for air avai...,"[real, gas, transport, property, air, availabl...","[405, 302, 436, 583, 616]","[259, 405, 302, 436, 437]"


In [114]:
def average_precision(model_rel, ground_truth):
    tp= 0
    precisions= []
    
    #finding precision at points at which relevant document is returned
    for index, value in enumerate(model_rel):
        if value in ground_truth:
            tp+=1
            precisions.append(tp/(index+1))
            
    #id no relevant document in list then return 0
    if precisions== []:
        return 0
    
    return np.mean(precisions)

In [115]:
#Let's run the above on a sample
average_precision([5,6,1,2,3], [1,2,3,4,5])

0.8041666666666667

In [116]:
#Finding average precision for each query
queries['jaccard_ap']= queries.apply(lambda x: average_precision(x['jaccard_rel'], x['ground_truth']), axis=1)
queries.head()

Unnamed: 0,qid,query,tokens,jaccard_rel,ground_truth,jaccard_ap
0,1,what similarity laws must be obeyed when const...,"[similarity, law, obey, construct, aeroelastic...","[12, 51, 378, 670, 875]","[184, 29, 31, 57, 378]",0.333333
1,2,what are the structural and aeroelastic proble...,"[structural, aeroelastic, problem, associate, ...","[12, 51, 700, 746, 875]","[12, 746, 15, 184, 858]",0.75
2,3,what problems of heat conduction in composite ...,"[problem, heat, conduction, composite, slab, s...","[5, 584, 6, 145, 582]","[5, 6, 90, 91, 119]",0.833333
3,8,what methods -dash exact or approximate -dash ...,"[method, dash, exact, approximate, dash, prese...","[122, 1306, 639, 655, 988]","[48, 122, 354, 360, 1005]",1.0
4,10,are real-gas transport properties for air avai...,"[real, gas, transport, property, air, availabl...","[405, 302, 436, 583, 616]","[259, 405, 302, 436, 437]",1.0


In [117]:
#Mean average precision
print('Mean Average Precision :', queries['jaccard_ap'].mean())

Mean Average Precision : 0.49555555555555564


#### Evaluation on validation set

In [118]:
# Adding ground truth in a column
queries_val['ground_truth']= queries_val['qid'].apply(lambda x:qrel_val[qrel_val['qid']==x]['docid'].values)

In [119]:
#Ranking documents according to jaccard coefficient
queries_val['jaccard_rel']= queries_val['tokens'].apply(lambda x: jaccard_rank(x))
queries_val.head()

Unnamed: 0,qid,query,tokens,ground_truth,jaccard_rel
0,189,is there a design method for calculating therm...,"[design, method, calculate, thermal, fatigue, ...","[395, 866, 869, 865, 868]","[868, 1306, 833, 906, 909]"
1,190,will an analysis of panel flutter based on arb...,"[analysis, panel, flutter, base, arbitrarily, ...","[15, 391, 285, 390, 864]","[390, 1008, 285, 21, 391]"
2,191,"what is the criterion for true panel flutter, ...","[criterion, true, panel, flutter, oppose, smal...","[914, 915, 285, 857, 858]","[285, 31, 864, 728, 15]"
3,194,how can the analytical solution of the bucklin...,"[analytical, solution, buckle, strength, unifo...","[739, 740, 742, 743, 744]","[932, 744, 1050, 1172, 1171]"
4,196,the problem of similarity for representative i...,"[problem, similarity, representative, investig...","[51, 185, 874, 875, 876]","[875, 1008, 184, 864, 655]"


In [120]:
#Finding average precision for each query
queries_val['jaccard_ap']= queries_val.apply(lambda x: average_precision(x['jaccard_rel'], x['ground_truth']), axis=1)
queries_val.head()

Unnamed: 0,qid,query,tokens,ground_truth,jaccard_rel,jaccard_ap
0,189,is there a design method for calculating therm...,"[design, method, calculate, thermal, fatigue, ...","[395, 866, 869, 865, 868]","[868, 1306, 833, 906, 909]",1.0
1,190,will an analysis of panel flutter based on arb...,"[analysis, panel, flutter, base, arbitrarily, ...","[15, 391, 285, 390, 864]","[390, 1008, 285, 21, 391]",0.755556
2,191,"what is the criterion for true panel flutter, ...","[criterion, true, panel, flutter, oppose, smal...","[914, 915, 285, 857, 858]","[285, 31, 864, 728, 15]",1.0
3,194,how can the analytical solution of the bucklin...,"[analytical, solution, buckle, strength, unifo...","[739, 740, 742, 743, 744]","[932, 744, 1050, 1172, 1171]",0.5
4,196,the problem of similarity for representative i...,"[problem, similarity, representative, investig...","[51, 185, 874, 875, 876]","[875, 1008, 184, 864, 655]",1.0


In [121]:
#Mean average precision
print('Mean Average Precision :', queries_val['jaccard_ap'].mean())

Mean Average Precision : 0.4431818181818181


#### Term Frequency

In [122]:
#Creating vocabulary set

vocabulary= set()

for i in documents['tokens'].values:
    vocabulary= vocabulary.union(set(i))
    
#sorting vocabulary alphabetically
vocabulary= sorted(vocabulary)

In [123]:
print("Size of vocabulary :", len(vocabulary))

Size of vocabulary : 3042


In [124]:
tf_list_doc= []

#Getting term frequencies
for tokens in documents['tokens']:
    #Initializing a dictionary with 0 frequency- keys as terms in vocabulary, value as 0.
    doc_dict= dict.fromkeys(vocabulary,0)
    
    #Counting term frequencies
    for term in tokens:
        doc_dict[term]+=1
        
    # Adding dictionary to list
    tf_list_doc.append(doc_dict)

In [125]:
len(tf_list_doc) 

#387 is the length of documents. For each document we get a dictionary with frequency mapping of each word in vocabulary

387

In [126]:
# Creating a dataframe of term frequencies for documents
documents_tf= pd.concat([documents['docid'], pd.DataFrame(tf_list_doc)], axis=1)
print('Shape :', documents.shape)
documents_tf.head()

Shape : (387, 6)


Unnamed: 0,docid,ab,abbreviate,ability,ablate,ablating,ablation,able,abrupt,abruptly,...,year,yield,york,young,z,zbrozek,zero,zeroth,zone,zuk
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [127]:
documents_tf.loc[0,documents_tf.loc[0,:]!=0][1:]

approximation    1
arise            1
body             2
boundary         5
classical        1
                ..
usually          1
viscosity        2
viscous          2
vorticity        2
wave             2
Name: 0, Length: 66, dtype: int64

In [128]:
# Getting term frequency for the first document in the dataset

print('Document ID: ', documents_tf['docid'][0])
documents_tf.loc[0, documents_tf.loc[0,:]!=0][1:].to_dict() #get all words with non-zero frequency

Document ID:  2


{'approximation': 1,
 'arise': 1,
 'body': 2,
 'boundary': 5,
 'classical': 1,
 'consequently': 1,
 'consider': 2,
 'constant': 1,
 'curved': 1,
 'different': 1,
 'dimensional': 2,
 'discuss': 1,
 'discussion': 1,
 'edge': 1,
 'effect': 1,
 'emit': 1,
 'exist': 1,
 'feature': 1,
 'ferri': 1,
 'flat': 3,
 'flow': 6,
 'fluid': 2,
 'free': 3,
 'high': 1,
 'hypersonic': 2,
 'incompressible': 2,
 'instance': 1,
 'investigate': 1,
 'inviscid': 3,
 'irrotational': 1,
 'layer': 5,
 'leading': 1,
 'libby': 1,
 'necessary': 1,
 'nose': 1,
 'novel': 1,
 'original': 1,
 'outside': 1,
 'paper': 1,
 'past': 4,
 'plate': 3,
 'possible': 1,
 'prandtls': 2,
 'present': 1,
 'problem': 4,
 'recently': 1,
 'region': 1,
 'restrict': 1,
 'rotational': 2,
 'shear': 2,
 'shock': 2,
 'show': 1,
 'simple': 2,
 'situation': 2,
 'small': 2,
 'somewhat': 1,
 'speed': 1,
 'steady': 1,
 'stream': 3,
 'study': 2,
 'treat': 1,
 'usually': 1,
 'viscosity': 2,
 'viscous': 2,
 'vorticity': 2,
 'wave': 2}

In [129]:
#Taking log normalized term frequency

def log_normalize(x):
    if x!=0:
        return 1+ np.log10(x)
    return 0

In [130]:
documents_tf.iloc[:,1:]= documents_tf.iloc[:, 1:].applymap(log_normalize)

  documents_tf.iloc[:,1:]= documents_tf.iloc[:, 1:].applymap(log_normalize)


In [131]:
# Getting log normalized term frequency for the first document in the dataset
print('Document ID: ', documents_tf['docid'][0])
documents_tf.loc[0, documents_tf.loc[0,:]!=0][1:].to_dict() #get all words with non-zero log frequency

Document ID:  2


{'approximation': 1.0,
 'arise': 1.0,
 'body': 1.3010299956639813,
 'boundary': 1.6989700043360187,
 'classical': 1.0,
 'consequently': 1.0,
 'consider': 1.3010299956639813,
 'constant': 1.0,
 'curved': 1.0,
 'different': 1.0,
 'dimensional': 1.3010299956639813,
 'discuss': 1.0,
 'discussion': 1.0,
 'edge': 1.0,
 'effect': 1.0,
 'emit': 1.0,
 'exist': 1.0,
 'feature': 1.0,
 'ferri': 1.0,
 'flat': 1.4771212547196624,
 'flow': 1.7781512503836436,
 'fluid': 1.3010299956639813,
 'free': 1.4771212547196624,
 'high': 1.0,
 'hypersonic': 1.3010299956639813,
 'incompressible': 1.3010299956639813,
 'instance': 1.0,
 'investigate': 1.0,
 'inviscid': 1.4771212547196624,
 'irrotational': 1.0,
 'layer': 1.6989700043360187,
 'leading': 1.0,
 'libby': 1.0,
 'necessary': 1.0,
 'nose': 1.0,
 'novel': 1.0,
 'original': 1.0,
 'outside': 1.0,
 'paper': 1.0,
 'past': 1.6020599913279625,
 'plate': 1.4771212547196624,
 'possible': 1.0,
 'prandtls': 1.3010299956639813,
 'present': 1.0,
 'problem': 1.602059991

In [132]:
### Ranking

qtokens= list(set(queries['tokens'][0]).intersection(vocabulary))
print('Tokens in Query: ', len(queries['tokens'][0]), queries['tokens'][0])
print('Tokens available in Vocabulary: ', len(qtokens), qtokens)

Tokens in Query:  10 ['similarity', 'law', 'obey', 'construct', 'aeroelastic', 'model', 'heat', 'high', 'speed', 'aircraft']
Tokens available in Vocabulary:  9 ['aircraft', 'law', 'similarity', 'construct', 'model', 'speed', 'heat', 'aeroelastic', 'high']


Here the issue we have is that not all the tokens in query are available in vocabulary.

In [133]:
print("Tokens not in vocabulary: ", set(queries['tokens'][0]).difference(set(qtokens)))

Tokens not in vocabulary:  {'obey'}


The above can lead to a complete failure of our ranking model. So, we need to take an intersection of query and document vocabulary before we do anything.

In [134]:
# Creating a list of columns to retrieve
columns= ['docid']
columns.extend(qtokens)
print(columns)

['docid', 'aircraft', 'law', 'similarity', 'construct', 'model', 'speed', 'heat', 'aeroelastic', 'high']


In [135]:
#Now extract term frequencies for the columns
temp_doc= documents_tf.loc[:,columns].copy()
temp_doc.head()

Unnamed: 0,docid,aircraft,law,similarity,construct,model,speed,heat,aeroelastic,high
0,2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5,0.0,0.0,0.0,0.0,0.0,0.0,1.60206,0.0,0.0
3,6,0.0,0.0,0.0,0.0,0.0,0.0,1.477121,0.0,0.0
4,12,1.30103,0.0,0.0,0.0,0.0,1.60206,1.0,1.30103,1.60206


In [136]:
#Adding all the frequencies
temp_doc['tf_sum']= temp_doc[qtokens].sum(axis=1)
temp_doc.head()

Unnamed: 0,docid,aircraft,law,similarity,construct,model,speed,heat,aeroelastic,high,tf_sum
0,2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,2.0
1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5,0.0,0.0,0.0,0.0,0.0,0.0,1.60206,0.0,0.0,1.60206
3,6,0.0,0.0,0.0,0.0,0.0,0.0,1.477121,0.0,0.0,1.477121
4,12,1.30103,0.0,0.0,0.0,0.0,1.60206,1.0,1.30103,1.60206,6.80618


In [137]:
#Sorting datarame according to tf_sum and getting relevant documents
temp_doc.sort_values(by= 'tf_sum', ascending=False).head()['docid'].values

array([ 51,  12, 184, 364, 572])

In [138]:
def tf_rank(qtokens):
  # Getting a list of unique query terms which are present in vocabulary
  qtokens=list(set(qtokens).intersection(vocabulary))
 
  # Creating list of columns to retrieve
  columns=['docid']
  columns.extend(qtokens)

  # Retireving term frequencies for query terms
  temp_doc=documents_tf.loc[:,columns].copy()

  # Adding all the frequencies
  temp_doc['tf_sum']=temp_doc[qtokens].sum(axis=1)

  # Sorting dataframe according to sum of TF and getting relevant docs
  rel_docs=temp_doc.sort_values(by='tf_sum',ascending=False).head()['docid'].values

  return rel_docs

In [139]:
#Ranking documents according to term frequency
queries['tf_rel']= queries['tokens'].apply(lambda x: tf_rank(x))
queries.head()

Unnamed: 0,qid,query,tokens,jaccard_rel,ground_truth,jaccard_ap,tf_rel
0,1,what similarity laws must be obeyed when const...,"[similarity, law, obey, construct, aeroelastic...","[12, 51, 378, 670, 875]","[184, 29, 31, 57, 378]",0.333333,"[51, 12, 184, 364, 572]"
1,2,what are the structural and aeroelastic proble...,"[structural, aeroelastic, problem, associate, ...","[12, 51, 700, 746, 875]","[12, 746, 15, 184, 858]",0.75,"[12, 172, 51, 746, 798]"
2,3,what problems of heat conduction in composite ...,"[problem, heat, conduction, composite, slab, s...","[5, 584, 6, 145, 582]","[5, 6, 90, 91, 119]",0.833333,"[5, 980, 584, 91, 395]"
3,8,what methods -dash exact or approximate -dash ...,"[method, dash, exact, approximate, dash, prese...","[122, 1306, 639, 655, 988]","[48, 122, 354, 360, 1005]",1.0,"[122, 234, 1104, 924, 1307]"
4,10,are real-gas transport properties for air avai...,"[real, gas, transport, property, air, availabl...","[405, 302, 436, 583, 616]","[259, 405, 302, 436, 437]",1.0,"[302, 185, 616, 1009, 1313]"


In [140]:
#Evaluation on Train Set
queries['tf_ap']= queries.apply(lambda x: average_precision(x['tf_rel'], x['ground_truth']), axis=1)
queries.head()

Unnamed: 0,qid,query,tokens,jaccard_rel,ground_truth,jaccard_ap,tf_rel,tf_ap
0,1,what similarity laws must be obeyed when const...,"[similarity, law, obey, construct, aeroelastic...","[12, 51, 378, 670, 875]","[184, 29, 31, 57, 378]",0.333333,"[51, 12, 184, 364, 572]",0.333333
1,2,what are the structural and aeroelastic proble...,"[structural, aeroelastic, problem, associate, ...","[12, 51, 700, 746, 875]","[12, 746, 15, 184, 858]",0.75,"[12, 172, 51, 746, 798]",0.75
2,3,what problems of heat conduction in composite ...,"[problem, heat, conduction, composite, slab, s...","[5, 584, 6, 145, 582]","[5, 6, 90, 91, 119]",0.833333,"[5, 980, 584, 91, 395]",0.75
3,8,what methods -dash exact or approximate -dash ...,"[method, dash, exact, approximate, dash, prese...","[122, 1306, 639, 655, 988]","[48, 122, 354, 360, 1005]",1.0,"[122, 234, 1104, 924, 1307]",1.0
4,10,are real-gas transport properties for air avai...,"[real, gas, transport, property, air, availabl...","[405, 302, 436, 583, 616]","[259, 405, 302, 436, 437]",1.0,"[302, 185, 616, 1009, 1313]",1.0


In [141]:
#Finding Mean average Precision
print('Mean Average Precision :', queries['tf_ap'].mean())

Mean Average Precision : 0.5897058823529412


In [142]:
# Ranking documents according to term frequency
queries_val['tf_rel']=queries_val['tokens'].apply(lambda x: tf_rank(x))

# Finding average precision for each query
queries_val['tf_ap']=queries_val.apply(lambda x: average_precision(x['tf_rel'],x['ground_truth']),axis=1)

# Finding Mean Average Precision
print('Mean Average Precision on Validation Set :',queries_val['tf_ap'].mean())

Mean Average Precision on Validation Set : 0.44368686868686863


#### Inverse document frequency

In [143]:
# Term frequency of documents
documents_tf.head()

Unnamed: 0,docid,ab,abbreviate,ability,ablate,ablating,ablation,able,abrupt,abruptly,...,year,yield,york,young,z,zbrozek,zero,zeroth,zone,zuk
0,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [149]:
# Initializing a dictionary to store IDF values
idf_dict= dict.fromkeys(vocabulary, 0)

#Count of non-zero values per column (Document frequency)
non_zero_count= np.count_nonzero(documents_tf.iloc[:,1:],axis=0)

#Assigning IDF values
for term, document_frequency in zip(list(vocabulary), non_zero_count):
    idf_dict[term]= np.log10(documents.shape[0]/(document_frequency))

In [150]:
idf_dict

{'ab': 2.28668096935493,
 'abbreviate': 2.5877109650189114,
 'ability': 2.5877109650189114,
 'ablate': 2.28668096935493,
 'ablating': 2.5877109650189114,
 'ablation': 1.9856509736909491,
 'able': 2.28668096935493,
 'abrupt': 2.5877109650189114,
 'abruptly': 2.28668096935493,
 'absence': 2.28668096935493,
 'absolute': 1.8887409606828927,
 'absorb': 2.5877109650189114,
 'absorption': 2.5877109650189114,
 'academic': 2.5877109650189114,
 'accelerate': 2.28668096935493,
 'accelerated': 2.28668096935493,
 'acceleration': 2.110589710299249,
 'accept': 2.110589710299249,
 'acceptability': 2.5877109650189114,
 'acceptance': 2.5877109650189114,
 'accessible': 2.5877109650189114,
 'accidental': 2.5877109650189114,
 'accommodate': 2.5877109650189114,
 'accommodation': 2.5877109650189114,
 'accompany': 2.28668096935493,
 'accompanying': 2.5877109650189114,
 'accomplish': 2.28668096935493,
 'accord': 1.7426129250046545,
 'accordance': 2.5877109650189114,
 'accordingly': 1.9856509736909491,
 'accoun

#### Ranking

In [151]:
#create a temporary dataframe
temp_doc= documents[['docid', 'tokens']].copy()

In [152]:
# Function for getting sum of IDF values for a query-document pair
def idf_sum(dtokens, qtokens):
    #getting common terms in query and document
    common_term= set(dtokens).intersection(set(qtokens))
    
    #Getting IDF values for common terms
    idf_list=[value for key, value in idf_dict.items() if key in common_term]
    
    return sum(idf_list)

In [153]:
def idf_rank(qtokens):
  # Getting sum of IDF values for all the quer-document pairs
  temp_doc['idf_sum']=temp_doc['tokens'].apply(lambda x: idf_sum(x,qtokens))

  # Sorting dataframe according to sum of IDF and getting relevant docs
  rel_docs=temp_doc.sort_values(by='idf_sum',ascending=False).head()['docid'].values

  return rel_docs

In [154]:
# Ranking documents according to inverse document frequency
queries['idf_rel']=queries['tokens'].apply(lambda x: idf_rank(x))
queries.head()

Unnamed: 0,qid,query,tokens,jaccard_rel,ground_truth,jaccard_ap,tf_rel,tf_ap,idf_rel
0,1,what similarity laws must be obeyed when const...,"[similarity, law, obey, construct, aeroelastic...","[12, 51, 378, 670, 875]","[184, 29, 31, 57, 378]",0.333333,"[51, 12, 184, 364, 572]",0.333333,"[184, 51, 12, 625, 332]"
1,2,what are the structural and aeroelastic proble...,"[structural, aeroelastic, problem, associate, ...","[12, 51, 700, 746, 875]","[12, 746, 15, 184, 858]",0.75,"[12, 172, 51, 746, 798]",0.75,"[12, 172, 51, 746, 364]"
2,3,what problems of heat conduction in composite ...,"[problem, heat, conduction, composite, slab, s...","[5, 584, 6, 145, 582]","[5, 6, 90, 91, 119]",0.833333,"[5, 980, 584, 91, 395]",0.75,"[5, 91, 625, 584, 90]"
3,8,what methods -dash exact or approximate -dash ...,"[method, dash, exact, approximate, dash, prese...","[122, 1306, 639, 655, 988]","[48, 122, 354, 360, 1005]",1.0,"[122, 234, 1104, 924, 1307]",1.0,"[122, 556, 1104, 234, 924]"
4,10,are real-gas transport properties for air avai...,"[real, gas, transport, property, air, availabl...","[405, 302, 436, 583, 616]","[259, 405, 302, 436, 437]",1.0,"[302, 185, 616, 1009, 1313]",1.0,"[302, 332, 405, 1009, 583]"


In [155]:
#Evaluation on train set
# Finding average precision for each query
queries['idf_ap']=queries.apply(lambda x: average_precision(x['idf_rel'],x['ground_truth']),axis=1)

# Finding Mean Average Precision
print('Mean Average Precision=>',queries['idf_ap'].mean())

Mean Average Precision=> 0.64281045751634


In [156]:
#Evaluation on validation set

# Ranking documents according to inverse document frequency
queries_val['idf_rel']=queries_val['tokens'].apply(lambda x: idf_rank(x))

# Finding average precision for each query
queries_val['idf_ap']=queries_val.apply(lambda x: average_precision(x['idf_rel'],x['ground_truth']),axis=1)

# Finding Mean Average Precision
print('Mean Average Precision on Validation Set=>',queries_val['idf_ap'].mean())

Mean Average Precision on Validation Set=> 0.37626262626262624


### TF-IDF

In [157]:
#Calculating TF-IDF
documents_tfidf= documents_tf.iloc[:,1:]* list(idf_dict.values())
documents_tfidf.head()

Unnamed: 0,ab,abbreviate,ability,ablate,ablating,ablation,able,abrupt,abruptly,absence,...,year,yield,york,young,z,zbrozek,zero,zeroth,zone,zuk
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [158]:
# Adding a column of docids
documents_tfidf= pd.concat([documents[['docid']],documents_tfidf], axis=1)
print('Shape :', documents_tfidf.shape)
documents_tfidf.head()

Shape : (387, 3043)


Unnamed: 0,docid,ab,abbreviate,ability,ablate,ablating,ablation,able,abrupt,abruptly,...,year,yield,york,young,z,zbrozek,zero,zeroth,zone,zuk
0,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [159]:
#Getting TF-IDF values for the terms in the first document
print("Document ID: ", documents_tfidf['docid'][0])
documents_tfidf.loc[0,documents_tfidf.loc[0,:]!=0][1:].to_dict()

Document ID:  2


{'approximation': 0.9249531333373373,
 'arise': 1.4415829293406734,
 'body': 0.8179179714411188,
 'boundary': 0.7825560609575196,
 'classical': 1.5463182798606863,
 'consequently': 2.110589710299249,
 'consider': 1.0167952618775644,
 'constant': 0.8887409606828927,
 'curved': 1.6334684555795864,
 'different': 1.1405529336766922,
 'dimensional': 1.0532614720216564,
 'discuss': 0.7243881048984555,
 'discussion': 1.2074997233073055,
 'edge': 0.7488618742816561,
 'effect': 0.4323749275538496,
 'emit': 2.5877109650189114,
 'exist': 1.0314084642516241,
 'feature': 1.3835909823629866,
 'ferri': 2.28668096935493,
 'flat': 1.1061597913506378,
 'flow': 0.5293135804637596,
 'fluid': 1.3741895767992272,
 'free': 1.1444700154382332,
 'high': 0.5964848893264165,
 'hypersonic': 1.1024255232317401,
 'incompressible': 1.311343739741794,
 'instance': 1.8887409606828927,
 'investigate': 1.0195092409519164,
 'inviscid': 1.8692846438486488,
 'irrotational': 1.9856509736909491,
 'layer': 0.8049169630236986,

In [160]:
#Ranking

def tf_idf_rank(qtokens):
  # Getting a list of unique query terms which are present in vocabulary
  qtokens=list(set(qtokens).intersection(vocabulary))

  # Creating list of columns to retrieve
  columns=['docid']
  columns.extend(qtokens)

  # Retireving TF-IDF for query terms
  temp_doc=documents_tfidf.loc[:,columns].copy()

  # Adding all the frequencies
  temp_doc['tfidf_sum']=temp_doc[qtokens].sum(axis=1)

  # Sorting dataframe according to sum of TF-IDF and getting relevant docs
  rel_docs=temp_doc.sort_values(by='tfidf_sum',ascending=False).head()['docid'].values

  return rel_docs

In [161]:
queries['tf_idf_rel']=queries['tokens'].apply(lambda x: tf_idf_rank(x))
queries.head()

Unnamed: 0,qid,query,tokens,jaccard_rel,ground_truth,jaccard_ap,tf_rel,tf_ap,idf_rel,idf_ap,tf_idf_rel
0,1,what similarity laws must be obeyed when const...,"[similarity, law, obey, construct, aeroelastic...","[12, 51, 378, 670, 875]","[184, 29, 31, 57, 378]",0.333333,"[51, 12, 184, 364, 572]",0.333333,"[184, 51, 12, 625, 332]",1.0,"[51, 184, 12, 332, 625]"
1,2,what are the structural and aeroelastic proble...,"[structural, aeroelastic, problem, associate, ...","[12, 51, 700, 746, 875]","[12, 746, 15, 184, 858]",0.75,"[12, 172, 51, 746, 798]",0.75,"[12, 172, 51, 746, 364]",0.75,"[12, 51, 172, 746, 724]"
2,3,what problems of heat conduction in composite ...,"[problem, heat, conduction, composite, slab, s...","[5, 584, 6, 145, 582]","[5, 6, 90, 91, 119]",0.833333,"[5, 980, 584, 91, 395]",0.75,"[5, 91, 625, 584, 90]",0.866667,"[5, 91, 90, 584, 625]"
3,8,what methods -dash exact or approximate -dash ...,"[method, dash, exact, approximate, dash, prese...","[122, 1306, 639, 655, 988]","[48, 122, 354, 360, 1005]",1.0,"[122, 234, 1104, 924, 1307]",1.0,"[122, 556, 1104, 234, 924]",1.0,"[122, 234, 1104, 556, 1307]"
4,10,are real-gas transport properties for air avai...,"[real, gas, transport, property, air, availabl...","[405, 302, 436, 583, 616]","[259, 405, 302, 436, 437]",1.0,"[302, 185, 616, 1009, 1313]",1.0,"[302, 332, 405, 1009, 583]",0.833333,"[302, 1009, 185, 583, 332]"


In [164]:
#Evaluation on training set

# Finding average precision for each query
queries['tfidf_ap']=queries.apply(lambda x: average_precision(x['tf_idf_rel'],x['ground_truth']),axis=1)

# Finding Mean Average Precision
print('Mean Average Precision=>',queries['tfidf_ap'].mean())

Mean Average Precision=> 0.659640522875817


In [165]:
#Evaluation on validation set

# Ranking documents according to TF-IDF
queries_val['tfidf_rel']=queries_val['tokens'].apply(lambda x: tf_idf_rank(x))

# Finding average precision for each query
queries_val['tfidf_ap']=queries_val.apply(lambda x: average_precision(x['tfidf_rel'],x['ground_truth']),axis=1)

# Finding Mean Average Precision
print('Mean Average Precision on Validation Set=>',queries_val['tfidf_ap'].mean())

Mean Average Precision on Validation Set=> 0.42941919191919187


#### TF-IDF based vector space model

In [166]:
def gen_tfidf_queries(queries_data):
  tf_list_queries=[]

  # Getting Term frequencies
  for tokens in queries_data['tokens']:
    # Initliatizing a dictionary with 0 frequency
    queries_dict=dict.fromkeys(vocabulary,0)      
    # Counting term frequencies
    for term in set(tokens).intersection(vocabulary):
      queries_dict[term]+=1
    # Adding dictionary to list
    tf_list_queries.append(queries_dict)

  # Creating a dataframe of term frequencies for queries
  queries_tf=pd.DataFrame(tf_list_queries)

  # Log Normalizing the term counts for queries
  queries_tf=queries_tf.applymap(log_normalize)

  # Calculating TF-IDF for queries
  queries_tfidf=queries_tf*list(idf_dict.values())

  # Adding a column of qids
  queries_tfidf=pd.concat([queries_data['qid'],queries_tfidf],axis=1)

  return queries_tfidf

In [167]:
# Creating TF-IDF vectors for queries in train set
queries_tfidf=gen_tfidf_queries(queries)
print('Shape :',queries_tfidf.shape)
queries_tfidf.head()

Shape : (85, 3043)


Unnamed: 0,qid,ab,abbreviate,ability,ablate,ablating,ablation,able,abrupt,abruptly,...,year,yield,york,young,z,zbrozek,zero,zeroth,zone,zuk
0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [168]:
# Creating TF-IDF vectors for queries in validation set
queries_val_tfidf=gen_tfidf_queries(queries_val)
print('Shape :',queries_val_tfidf.shape)
queries_val_tfidf.head()

Shape : (22, 3043)


Unnamed: 0,qid,ab,abbreviate,ability,ablate,ablating,ablation,able,abrupt,abruptly,...,year,yield,york,young,z,zbrozek,zero,zeroth,zone,zuk
0,189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,190,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [169]:
# Ranking

#temporary dataframe with tfidf values
temp_doc_tfidf= documents_tfidf.copy()

In [170]:
from sklearn.metrics.pairwise import cosine_similarity

In [171]:
queries_tfidf.iloc[0,1:].values

array([0., 0., 0., ..., 0., 0., 0.])

In [173]:
temp_doc_tfidf.iloc[0,1:].values

array([0., 0., 0., ..., 0., 0., 0.])

In [174]:
#Cosine similarity takes the vectors as nested arrays

cosine_similarity(queries_tfidf.iloc[0,1:].values.reshape(1,-1),temp_doc_tfidf.iloc[0,1:].values.reshape(1,-1) )

array([[0.01570257]])

In [175]:
cosine_similarity(queries_tfidf.iloc[0,1:].values.reshape(1,-1),temp_doc_tfidf.iloc[0,1:].values.reshape(1,-1) ).item()

0.015702570044666825

In [176]:
def tfidf_vsm_rank(queries_data):
  # Finding cosine similarity score for every document vector against a query vector
  temp_doc_tfidf['tfidf_vsm']=temp_doc_tfidf.apply(lambda x: cosine_similarity(x.values[1:].reshape(1, -1),queries_data[1:].values.reshape(1, -1)).item(),axis=1)
  
  # Sorting dataframe according to sum of cosine similarity score and getting relevant docs
  rel_docs=temp_doc_tfidf.sort_values(by='tfidf_vsm',ascending=False).head()['docid'].values

  # Droppping similarity column
  temp_doc_tfidf.drop(columns='tfidf_vsm',inplace=True)

  return rel_docs

In [178]:
queries['tfidf_vsm_rel']= queries_tfidf.apply(lambda x: tfidf_vsm_rank(x), axis=1)

In [183]:
# Evaluation on train set

# Finding average precision for each query
queries['tfidf_vsm_ap']=queries.apply(lambda x: average_precision(x['tfidf_vsm_rel'],x['ground_truth']),axis=1)

# Finding Mean Average Precision
print('Mean Average Precision on train set :',queries['tfidf_vsm_ap'].mean())

Mean Average Precision on train set : 0.6595261437908496


In [184]:
# Ranking documents according to Cosine Similarity
queries_val['tfidf_vsm_rel']=queries_val_tfidf.apply(lambda x: tfidf_vsm_rank(x),axis=1)

# Finding average precision for each query
queries_val['tfidf_vsm_ap']=queries_val.apply(lambda x: average_precision(x['tfidf_vsm_rel'],x['ground_truth']),axis=1)

# Finding Mean Average Precision
print('Mean Average Precision on validation set :',queries_val['tfidf_vsm_ap'].mean())

Mean Average Precision on validation set : 0.3535353535353535
