<div style = "font-family: Helvetica Neue;; font-size:260%;"> 
<b>Document Turbo <span style = "color:#4285F4">F</span><span style = "color:#DB4437">e</span><span style = "color:#F4B400">t</span><span style = "color:#0F9D58">c</span><span style = "color:#4285F4">h</span><span style = "color:#DB4437">e</span><span style = "color:#F4B400">r</span><span style = "color:#0F9D58">z</span></b>

</div>

<div style = "font-family: Helvetica Neue; font-size:110%"> 
Arjuna Beuger <br>
Kato Schmidt <br>
Sijf Schermerhorn <br>
Kim Tigchelaar <br><br>

</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
    üí° <b>First things first:</b> importing all of the needed libraries.
</div>

In [1]:
import os
import codecs
import json
import ujson
import rapidjson
import argparse
import string
from collections import defaultdict, Counter
from zipfile import ZipFile
from tqdm import tqdm_notebook
import tqdm
import math

import numpy as np

import torch.utils.data as data
import torch
import torch.nn as nn
import torch.optim as optim

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import regexp_tokenize
from nltk.stem.porter import *

import sklearn
import statistics

In [2]:
%load_ext memory_profiler

<div id = "data", style = "font-family: Helvetica Neue;; font-size:190%"> <span style = "color:#4285F4"><b>1 Dataloader</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
We load in the data using the provided functions, which all have <i>path</i> as only argument. This path always needs to be equivalent to the path of the corresponding datafile in our folder system.
</div>

In [3]:
def passage_loader(path):
    print("Load passages from: {}".format(path))   
    passages = ujson.load(open(path, 'r', encoding="utf-8", errors="ignore"))    
    return passages

def query_loader(path):    
    print("Load queries from: {}".format(path))
    queries = ujson.load(open(path, 'r'))    
    return queries

def label_loader(path):
    print("Load labels from: {}".format(path))
    labels = ujson.load(open(path, 'r'))    
    return labels

def index_loader(path):
    print("Load passages from: {}".format(path))   
    index = ujson.load(open(path, 'r', encoding="utf-8", errors="ignore"))    
    return index

def postings_loader(path):  
    postings = ujson.load(open(path, 'r', encoding="utf-8", errors="ignore"))    
    return postings


# %memit passages = passage_loader("data/passages_small.json")

# %memit queries_training = query_loader("data/training_queries.json")
# %memit queries_validation = query_loader("data/validation_queries.json")
# %memit queries_test = query_loader("data/test_queries.json")

%memit labels_training = label_loader("data/training_labels.json")
%memit labels_validation = label_loader("data/validation_labels.json")

Load labels from: data/training_labels.json
Load labels from: data/training_labels.json
Load labels from: data/training_labels.json
peak memory: 226.35 MiB, increment: 3.91 MiB
Load labels from: data/validation_labels.json
Load labels from: data/validation_labels.json
Load labels from: data/validation_labels.json
peak memory: 228.25 MiB, increment: -3.02 MiB


<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#DB4437"><b>2 Pre-processing</b></span>
</div>

<div style = "font-family: Helvetica Neue;; font-size:110%"> 
In this preprocessing pipeline we take the data collections and use various methods to reduce the number of tokens by removing as much redundant data to increase the discriminatory power of the search engine. We preprocess the data using the following metrics: tokenization, stemming, removing stopwords and removing all characters not in ASCII. <br>
    
The tokens are filtered out if they are not in the ASCII set of characters. ASCII, also known as the American Standard Code for Information Interchange, is a very common character encoding format for text data in computers and on the internet. The ASCII table represents 128 English characters as numbers, with each letter assigned a number from 0 to 127. When pre-processing the data, characters are first encoded to their corresponding ASCII number. If it isn‚Äôt possible to encode a character, this means the character isn‚Äôt in ASCII and is therefore ignored. Lastly, the remaining characters are decoded back to their alphabetical or numerical character, which means only relevant tokens are kept. <br>

By choosing to filter based on ASCII characters in stead of UTF-8, a trade-off has been made between relevance and information gain. When testing our code using UTF-8, irrelevant (or non-processable) characters such as emojis weren‚Äôt filtered out. This has negative consequences for the accuracy of our model. Additionally, since UTF-8 is a superset of all characters in widespread use today, it contains over one million codeprints, whereas ASCII only contains 128 (see the venn diagram below for a visualization). Since limited memory is a big challenge during this project, ASCII was the most rational choice.
</div>

<img src="venn.png" alt="Drawing" style="width: 400px;"/>

<div class="alert alert-info", style = "font-family: Helvetica Neue; font-size:110%"> 
Both the passages and the queries are preprocessed, using the same techniques. This doesn‚Äôt only make the code mor efficient, but also reduces the <i>vocabulary mismatch</i> problem because both the passages and the queries are reduced to their most ‚Äúcompact‚Äù form of only relevant tokens. 
</div>


<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#DB4437"><b>Word processing</b></span>
</div>

<div style = "font-family: Helvetica Neue; font-size:110%"> 
    
<div style = "font-family: Helvetica Neue; font-size:120%"> 
<span style = "color:#DB4437"><b>2.1 Tokenizing</b></span>
</div>
When tokenizing, we use the Treebank tokenizer from the nltk module. This tokenizer performs the following steps: <br>
    
    
<ul>
 <li>Standard contractions are split, splitting them into two separate tokens (since they are two separate words in the first place). For example, <i>don‚Äôt</i> will be turned into <i>do</i> and <i>n‚Äôt</i>, since this is an abbreviation of two separate words. </li>
 <li>Most punctuation characters are treated as separate tokens.</li>
 <li>Commas and single quotes are split off when they are followed by whitespace.</li>
 <li>Periods that appear at the end of the line are separated. </li>
</ul>
<br>

<div style = "font-family: Helvetica Neue; font-size:120%"> 
<span style = "color:#DB4437"><b>2.2 Stopwords</b></span>
</div>
To reduce the number of tokens, all stopwords that are in the nltk‚Äôs list of English stopwords are removed.<br><br>
    
<div style = "font-family: Helvetica Neue; font-size:120%"> 
<span style = "color:#DB4437"><b>2.3 Stemming</b></span>
</div>
Reduces the tokens to their root form. For example, eating will be turned into eat. We have used the <i>PorterStemmer()</i>. This stemmer, listed in Croft et al. (2015), is the most popular algorithmic stemmer and has been used since the 1970s.
    
> "The stemmer consists of a number
of steps, each containing a set of rules for removing suffixes. At each step, the rule
for the longest applicable suffix is executed." (Croft et al., 2015)
    
<div style = "font-family: Helvetica Neue; font-size:120%"> 
<span style = "color:#DB4437"><b>2.4 Removing non-ASCII characters</b></span>
</div>
Removes all tokens that contain characters not in the ASCII standard set.

</div>

In [4]:
def tokenize(text: str):
    tokenizer = nltk.TreebankWordTokenizer()
    tokens = [t for s in nltk.sent_tokenize(text) for t in tokenizer.tokenize(s)]
    tokens = [t for t in tokens if not all([c in string.punctuation for c in t])]
    return tokens

def stemming(tokens):
    tokens =  [PorterStemmer().stem(i) for i in tokens]
    return tokens

def stopping(tokens):
    stopwords =nltk.corpus.stopwords.words("english")
    tokens = [i.replace("/","-") for i in tokens if i not in stopwords]
    return tokens

def remove_non_ascii(tokens):
    tokens = [i.encode("ascii", "ignore").decode() for i in tokens]
    return tokens

def preprocess(text: str):    
    tokens = tokenize(text)
    tokens = stemming(tokens)
    tokens = stopping(tokens)
    
    tokens = remove_non_ascii(tokens)
    return tokens

def process_passages(passages):
    passages_tokenised = {}
    for passage_id in tqdm.notebook.tqdm(passages.keys()):
        passages_tokenised[passage_id] = preprocess(passages[passage_id])
    return passages_tokenised

def process_queries(queries):
    queries_tokenised = {}  
    for query_id in queries.keys():
        queries_tokenised[query_id] = preprocess(queries[query_id])
    return queries_tokenised  


# %memit tokenised_queries_training = process_queries(queries_training)
# %memit tokenised_queries_validation = process_queries(queries_validation)
# %memit tokenised_queries_test = process_queries(queries_test)
# %memit tokenised_passages = process_passages(passages)

In [5]:
# %memit tokenised_passages = passage_loader("data/small_tokenised_passages.json")

<div class="alert alert-success", style = "font-family: Helvetica Neue; font-size:110%"> 
    As overall output of the pre-processing, the dictionary <b>tokenised_passages</b> is returned. This dict is based on the <i>passages_small.json</i> file. The output dict is structured as following:
</div>

```
{"pid_123456" :
    ["file",
     "browser",
     "appear",
     ...
    ]
}
```


<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#F4B400"><b>3 Building the index</b></span>
</div>

<div style = "font-family: Helvetica Neue; font-size:110%"> 
After pre-processing, the data is ready to be used. The first step in making our search engine is to build both an <i>inverted</i> and a <i>language model</i> index. The challenge of this part is that our working memory is limited, which means that we need to implement smart and efficient ways to build our indexes. To do this, we have chosen to create separate files beforehand. The folders are structured as following:
    
> For every character in ASCII, there is a corresponding a __folder__ with the same name.
    >> Every folder contains a __collection of json files__ of all of the corresponding tokens that start with the character of the folder name.
    
For example if we create the index information for the token "book", a json file is created named __book.json__ containing the corresponding properties and is placed in folder __b__.
</div>

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#F4B400"><b>3.1 Creating the files</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
The function <b>create_files()</b> was written to create the needed    paths, folders and files beforehand. This changes based on the chosen pre-processing techniques.
</div>

In [6]:
def create_files(tokenised_passages, path):
    for i in tqdm.notebook.tqdm(tokenised_passages.values()):
        for j in i:
            try:
                os.mkdir(f'{path}/{j[0]}')
            except:
                pass
    return 

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#F4B400"><b>3.2 Creating the letters</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
The function <b>letters()</b> is used later on to create the <i>language model</i> index.
</div>

In [7]:
def letters(tokenised_passages):
    letter = set()
    for i in tqdm.notebook.tqdm(tokenised_passages.values()):
        for j in i:
            if len(j) >0:
                letter.add(j[0])
    return list(letter)

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#F4B400"><b>3.3 The TF-IDF index</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
    The function <b>tf_index()</b> creates the <b>inverted index</b> based on the <i>term frequency</i> and writes the json files with the needed information for every token on disk within the structured described above.
</div>

In [8]:
def tf_index(letter, passages):
    postings = defaultdict(list)
    tf_info = dict()
    total_doc_length = 0
    
    for pid, passage in tqdm.notebook.tqdm(passages.items()):
        counted = Counter(passage)
        total_doc_length += len(passage)

        for term, tf in counted.items():
                
            if len(term) < 200 and len(term) > 0 and term[0] == letter:
                postings[term].append({'pid':pid, 'tf':tf, 'length_document':len(passage)})

    for i in postings.keys():
        rapidjson.dump(postings[i], open(f"small_index/small_tf/{letter}/{i}.json", 'w'))

    return 


<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
    The function <b>tf_info()</b> creates a separate file containing the term frequency information. Writing this function beforehand improves the efficiency of our code and ensures that our compulations later on can be done as fast as possible.  
</div>

In [9]:
def tf_info(passages, path):
    tf_info = dict()
    total_doc_length = 0
    
    for pid, passage in tqdm.notebook.tqdm(passages.items()):
        
        total_doc_length += len(passage)

    tf_info["total_documents"] = len(passages)
    tf_info['average_doc_length'] = total_doc_length/len(passages)
    rapidjson.dump(tf_info, open(f"{path}/tf_info.json", 'w'))

    return 

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#F4B400"><b>3.4 The Language Model index</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
    The function <b>lm_index()</b> creates the <b>inverted index</b> based on the <i>language model</i> and writes the json files with the needed information for every token on disk within the structured described above.
</div>

In [10]:
def lm_index(letter, tokens_dict):
    
    index = defaultdict(lambda:{'postings':[], 'corpus_frequency': 0})
    tf_counter = defaultdict()
    corpus_length = 0

     
    # Example of how to iterate over a dataframe
    for document, token_list in tqdm.notebook.tqdm(tokens_dict.items()):
        
        tf_counter = Counter(token_list)
        corpus_length += len(token_list)
      
        for token, tf in tf_counter.items():
            
            if len(token) < 200 and len(token) > 0 and token[0] == letter:
                
                posting = {'pid': document, 'term_frequency': tf/len(token_list)}
                index[token]['postings'].append(posting)
                index[token]['corpus_frequency'] += tf
                    
    for k,v in index.items():
            index[k]['corpus_frequency'] = index[k]['corpus_frequency']/corpus_length
    
    
    for i in tqdm.notebook.tqdm(index.keys()):
        rapidjson.dump(index[i], open(f"small_index/small_lm/{letter}/{i}.json", 'w'))

    return 


<div style = "font-family: Helvetica Neue; font-size:190%"> 
<span style = "color:#0F9D58"><b>4 Creating some meta data</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
    The function <b>tf_meta()</b> creates a dictionary with meta data that is used later in the pipeline to compute calculations and therefore improves the efficiency of our code.
</div>

In [11]:
def tf_meta(passages):
    """
    Creates a dict with meta-data
    """
    
    tf_meta = dict()
    corpus_length = 0
    
    for pid, passage in tqdm.notebook.tqdm(passages.items()):
        
        corpus_length += len(passage)

    tf_meta["total_documents"] = len(passages)
    tf_meta['average_doc_length'] = corpus_length/len(passages)
    rapidjson.dump(tf_meta, open(f"tf_meta.json", 'w'))

    return

<div id = "data", style = "font-family: Helvetica Neue;; font-size:190%"> <span style = "color:#4285F4"><b>5 Ranking Algorithms</b></span>
</div>

<div id = "data", style = "font-family: Helvetica Neue;; font-size:160%"> <span style = "color:#4285F4"><b>5.1 TF-IDF</b></span>
</div>

In [12]:
def search_tf_idf(tokens, top_k=10, data_size='large'):
    """
    Computes tf-idf scores
    """
    title_dict = defaultdict(float)
    if data_size == 'large':
        path = 'large_index/postings_tf'
    elif data_size == 'small':
        path = 'small_index/small_tf'
    
    for term in tokens:
        try:
            index = json.load(open(f'{path}/{term[0]}/{term}.json'))
            tf_info = json.load(open(f'{path}/tf_info.json'))
            
            for document in index:
                title_dict[document['pid']] +=  (1+np.log(document['tf']))*\
                (np.log(tf_info['total_documents']/document['length_document']))
        except:
            pass
    
    titles = [(k, v) for k,v in title_dict.items()]

    return sorted(titles, key=lambda m: (-m[1],m[0]))[:top_k]


<div class="alert alert-success", style = "font-family: Helvetica Neue; font-size:110%"> 
    As output of the <b>search_tf_idf()</b> function, a list is returned containing tuples with for every passage id the corresponding tf-idf score. This list of tuples is sorted based on the tf-idf score, from highest to lowest. The following is an examlpe of the output structure:
</div>

```
[(pid_123456, tf_idf),
 (pid...)]
```

<div id = "data", style = "font-family: Helvetica Neue;; font-size:160%"> <span style = "color:#4285F4"><b>5.2 Query Likelihood</b></span>
</div>

In [13]:
def query_likelihood(tokens, top_k=10, data_size='large'):
    """
    Document Turbo Fetching!
    """
    titles = []
    query_terms = {}
    query_frequency = defaultdict(dict)
    
    if data_size == 'large':
        path = 'large_index/postings_lm'
    elif data_size == 'small':
        path = 'small_index/small_lm'
   
    for token in tokens:
        query_terms[token] = set()    
        try:
            index = json.load(open(f'{path}/{token[0]}/{token}.json'))
 
        except:
            pass
     
    for document in index['postings']:
        query_terms[token].add(document['pid'])
        query_frequency[token][document['pid']] =  document['term_frequency']

    common_docs = set()
    
    for x in query_terms.values():
        if len(common_docs) == 0:
            common_docs = x
        else:
            common_docs = set.intersection(common_docs, x)
        
    title_dict = defaultdict(float)
    
    for token in tokens:
        for document in query_frequency[token]:
            if document in common_docs:
                if not title_dict[document]:
                    title_dict[document] = np.log(query_frequency[token][document])
                else:
                    title_dict[document] += np.log(query_frequency[token][document])
            else:
                title_dict[document] = 0
            
    for k,v in title_dict.items():     
        titles.append((k, int(v)))
                      
    return sorted(titles, key=lambda m: (-m[1], m[0]))[:top_k]


<div class="alert alert-success", style = "font-family: Helvetica Neue; font-size:110%"> 
    As output of the <b>query_likelihood()</b> function, a list is returned containing tuples with for every passage id the corresponding query likelihood score. This list of tuples is sorted based on the query likelihood score, from lowest to highest. The following is an examlpe of the output structure:
</div>

```
[(pid_123456, query likelihood score),
 (pid...)]
```

<div id = "data", style = "font-family: Helvetica Neue;; font-size:160%"> <span style = "color:#4285F4"><b>5.3 Smooth Query Likelihood</b></span>
</div>

In [14]:
def smooth_query_likelihood(tokens, top_k=10, alpha = 0.1, data_size='large'):
    titles = []
    query_terms = defaultdict(set)
    query_frequency = defaultdict(dict)
     
    if data_size == 'large':
        path = 'large_index/postings_lm'
    elif data_size == 'small':
        path = 'small_index/small_lm'

    for token in tokens:
        try:
            index = json.load(open(f'{path}/{token[0]}/{token}.json'))
        except:
            pass
    
    for document in index['postings']:
        query_terms[token].add(document['pid'])
        query_frequency[token][document['pid']] =  document['term_frequency']
            
    for i in range(len(tokens)):
                           
        if query_terms[tokens[i-1]] not in query_terms[tokens[i]]:
            for document in query_frequency[tokens[i]]:
                if document not in query_frequency[tokens[i-1]]:
                    query_frequency[tokens[i-1]][document] = 0      
      
    common_docs = set()
    
    for x in query_terms.values():
        if len(common_docs) == 0:
            common_docs = x
        else:
            common_docs = set.intersection(common_docs, x)
        
    either_docs = set().union(*query_terms.values())

    title_dict = defaultdict(float)
    
    for token in tokens:
        for document in query_frequency[token]:
            

            if document in common_docs:
                if not title_dict[document]:
                    title_dict[document] = np.log(query_frequency[token][document])
                
                else:
                    title_dict[document] += query_frequency[token][document]
          
            elif document in either_docs:

                if not title_dict[document]:
                    title_dict[document] = np.log( (alpha * index['corpus_frequency']) + ((1-alpha) * query_frequency[token][document])) 
                else:
                    title_dict[document] += np.log( (alpha * index['corpus_frequency']) + ((1-alpha) * query_frequency[token][document]))         
                    
    for k,v in title_dict.items():     
        titles.append((k, round(v, 4)))
    
    return sorted(titles, key=lambda m: (-m[1], m[0]))[:top_k]


<div class="alert alert-success", style = "font-family: Helvetica Neue; font-size:110%"> 
    As output of the <b>query_likelihood()</b> function, a list is returned containing tuples with for every passage id the corresponding smooth query likelihood score. This list of tuples is sorted based on the smooth query likelihood score, from lowest to highest. The following is an examlpe of the output structure:
</div>

```
[(pid_123456, smooth query likelihood score),
 (pid...)]
```

<div id = "data", style = "font-family: Helvetica Neue;; font-size:160%"> <span style = "color:#4285F4"><b>5.4 BM25</b></span>
</div>

In [15]:
def bm25(tokens, k_1=1.05 , k_3=0, b=0.85 , top_k=10, data_size='large'):
    """
    Computes bm25 scores
    """
    title_dict = defaultdict(float)
    
    title_dict = defaultdict(float)
    if data_size == 'large':
        path = 'large_index/postings_tf'
    elif data_size == 'small':
        path = 'small_index/small_tf'
    
    for term in tokens:
        
        try:
            postings = postings_loader(f'{path}/{term[0]}/{term}.json')
            meta_data = postings_loader(f'{path}/tf_info.json')
            doc_freq = len(postings)
        
            for document in postings:
            
                title_dict[document['pid']] +=  (((k_1+ 1) * document['tf'])/((k_1 * ((1-b)+ (b * (document['length_document']/meta_data['average_doc_length'])))) + document['tf']) * \
                (np.log(meta_data['total_documents']/doc_freq))  * (((k_3 + 1)*document['tf'])/(k_3 + document['tf'])))
        
        except:
            pass
        

    
    titles = [(k, v) for k,v in title_dict.items()]

    
    return sorted(titles, key=lambda m: (-m[1],m[0]))[:top_k]


<div class="alert alert-success", style = "font-family: Helvetica Neue; font-size:110%"> 
    As output of the <b>bm25()</b> function, a list is returned containing tuples with for every passage id the corresponding bm25 score. This list of tuples is sorted based on the bm25 score, from highest to lowest. The following is an examlpe of the output structure:
</div>

```
[(pid_123456, bm25 score),
 (pid...)]
```

<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#DB4437"><b>6 Evaluation </b></span>
</div>

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#DB4437"><b>6.1 Functions</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
In the cell below, you'll find the functions corresponding to the evaluation metrics we've used to evaluate the results of our search engine: <br>
    
<ul>
    <li>The dcg score</li>
    <li>The ndcg score</li>
    <li>The precision score</li>
    <li>The average precision</li>
</ul>
</div>

In [16]:
def dcg(scores):

    dcg = []
    
    for i, r in np.ndenumerate(scores):
        score = r / np.log2(i[0] + 2)
        dcg.append(score)
    
    return sum(dcg)

def ndcg(scores, k):
    if dcg(scores, k) / dcg(-np.sort(-scores), k) == np.nan:
        return 0
    else:
        return dcg(scores, k) / dcg(-np.sort(-scores), k)
    
def precision(scores, k):

    if scores[:k].sum():
        return sum(scores[:k])/len(scores[:k])
    else:
        return 0.0
    
def average_precision(scores):
    if scores.sum():
        return np.array([precision(scores, i+1) for i in np.nonzero(scores)[0]]).sum()/abs(scores.sum())
    else:
        return 0.0    


<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
Subsequently, we have a cell for every evaluation metric which contain functions that evaluate every ranker that we've implemented (tf-idf, query likelihood, smoothed query likelihood and BM25, respectively) on that measure.
</div>

In [17]:
def p20_map_tfidf(qid, labels, tokenised_queries):
    
    output = []
    
    pids = [i[0] for i in search_tf_idf(tokenised_queries[qid], len(labels[qid]))]
    validation = list(labels[qid].keys())
    for i in pids:
        if i in validation:
            output.append(1)
        else: output.append(0)
    
    return np.array(output)

def p20_map_QL(qid, labels, tokenised_queries):
    
    output = []
    
    pids = [i[0] for i in query_likelihood(tokenised_queries[qid], len(labels[qid]))]
    validation = list(labels[qid].keys())
    for i in pids:
        if i in validation:
            output.append(1)
        else: output.append(0)
    
    return np.array(output)

def p20_map_SQL(qid, labels, tokenised_queries, alpha):
    
    output = []
    
    pids = [i[0] for i in smooth_query_likelihood(tokenised_queries[qid], len(labels[qid]), alpha)]
    validation = list(labels[qid].keys())
    for i in pids:
        if i in validation:
            output.append(1)
        else: output.append(0)
    
    return np.array(output)

def p20_map_BM25(qid, labels, tokenised_queries, k_1, k_3, b):
    
    output = []
    
    pids = [i[0] for i in bm25(tokenised_queries[qid], k_1, k_3, b ,len(labels[qid]))]
    validation = list(labels[qid].keys())
    for i in pids:
        if i in validation:
            output.append(1)
        else: output.append(0)
    
    return np.array(output)

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#DB4437"><b>6.1.1 Precision@20</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
Average precision@20 with the tf_idf ranking and the validation set. 
</div>

In [18]:
def precision20_tfidf(tokenised_queries, labels, k):
    
    precisions = []
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        precisions.append(precision(p20_map_tfidf(qid, labels, tokenised_queries),k))
        
    return np.mean(precisions)       

def precision20_QL(tokenised_queries, labels, k):
    
    precisions = []
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        precisions.append(precision(p20_map_QL(qid, labels, tokenised_queries),k))
        
    return np.mean(precisions)  

def precision20_SQL(tokenised_queries, labels, k, alpha):
    
    precisions = []
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        precisions.append(precision(p20_map_SQL(qid, labels, tokenised_queries, alpha),k))
        
    return np.mean(precisions)    

def precision20_BM25(tokenised_queries, labels, k, k_1, k_3, b):
    
    precisions = []
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        precisions.append(precision(p20_map_BM25(qid, labels, tokenised_queries, k_1, k_3, b),k))
        
    return np.mean(precisions)    

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#DB4437"><b>6.1.2 MAP</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
Mean-average-precision with the tf_idf ranking and the validation set. 
</div>


In [19]:
def MAP_tfidf(tokenised_queries, labels):
    
    MAP = []
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        MAP.append(average_precision(p20_map_tfidf(qid, labels, tokenised_queries)))
        
    return np.mean(MAP)

def MAP_QL(tokenised_queries, labels):
    
    MAP = []
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        MAP.append(average_precision(p20_map_QL(qid, labels, tokenised_queries)))
        
    return np.mean(MAP)

def MAP_SQL(tokenised_queries, labels, alpha):
    
    MAP = []
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        
        MAP.append(average_precision(p20_map_SQL(qid, labels, tokenised_queries, alpha)))
        
    return np.mean(MAP)

def MAP_BM25(tokenised_queries, labels, k_1, k_3, b):
    
    MAP = []
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        MAP.append(average_precision(p20_map_BM25(qid, labels, tokenised_queries, k_1, k_3, b)))
        
    return np.mean(MAP)

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#DB4437"><b>6.1.3 nDCG</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
Average nDCG of the top 20 predicted relevance values for all queries.
</div>


In [20]:
def ndcg20_tfidf(tokenised_queries, labels, k):
    
    real_vals = []
    ideal_vals = []
    ndcg_list = []
    
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        
        # welke labels staan in de top k van onze voorspelling?
        prediction = search_tf_idf(tokenised_queries[qid], k)
        
        # wat zijn de echte voorspelde labels?
        real = labels[qid]
        
        for p in prediction:
            pid = p[0]

            # wat is de real relevance van deze prediction voor deze query?
            if pid in real.keys():
                rr = real[pid]
                real_vals.append(rr)
                
            # als de pid niet in de echte labels staat, dan 0 invullen 
            else:
                real_vals.append(0)
                
            # de ideale optie is altijd 3, want dat is de maximale relevance
            ideal_vals.append(3)
            
        ndcg = dcg(real_vals) / dcg(ideal_vals)
        ndcg_list.append(ndcg)
        
        
    return np.mean(ndcg_list)

def ndcg20_QL(tokenised_queries, labels, k):
    
    real_vals = []
    ideal_vals = []
    ndcg_list = []
    
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        
        # welke labels staan in de top k van onze voorspelling?
        prediction = query_likelihood(tokenised_queries[qid], k)
        
        # wat zijn de echte voorspelde labels?
        real = labels[qid]
        
        for p in prediction:
            pid = p[0]

            # wat is de real relevance van deze prediction voor deze query?
            if pid in real.keys():
                rr = real[pid]
                real_vals.append(rr)
                
            # als de pid niet in de echte labels staat, dan 0 invullen 
            else:
                real_vals.append(0)
                
            # de ideale optie is altijd 3, want dat is de maximale relevance
            ideal_vals.append(3)
            
        ndcg = dcg(real_vals) / dcg(ideal_vals)
        ndcg_list.append(ndcg)
        
        
    return np.mean(ndcg_list)

def ndcg20_SQL(tokenised_queries, labels, k, alpha):
    
    real_vals = []
    ideal_vals = []
    ndcg_list = []
    
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        
        # welke labels staan in de top k van onze voorspelling?
        prediction = smooth_query_likelihood(tokenised_queries[qid], k, alpha)
        
        # wat zijn de echte voorspelde labels?
        real = labels[qid]
        
        for p in prediction:
            pid = p[0]

            # wat is de real relevance van deze prediction voor deze query?
            if pid in real.keys():
                rr = real[pid]
                real_vals.append(rr)
                
            # als de pid niet in de echte labels staat, dan 0 invullen 
            else:
                real_vals.append(0)
                
            # de ideale optie is altijd 3, want dat is de maximale relevance
            ideal_vals.append(3)
            
        ndcg = dcg(real_vals) / dcg(ideal_vals)
        ndcg_list.append(ndcg)
        
        
    return np.mean(ndcg_list)

def ndcg20_BM25(tokenised_queries, labels, k, k_1=1.2 , k_3=0, b=0.68):
    
    real_vals = []
    ideal_vals = []
    ndcg_list = []
    
    ndcg_score = []
    
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        
        # wat zijn de echte voorspelde labels?
        real = labels[qid]
        
        # welke labels staan in de top k van onze voorspelling?
        prediction = bm25(tokenised_queries[qid], k_1, k_3, b, k)
        
        
        for p in prediction:
            pid = p[0]

            # wat is de real relevance van deze prediction voor deze query?
            if pid in real.keys():
                rr = real[pid]
                real_vals.append(rr)
                
            # als de pid niet in de echte labels staat, dan 0 invullen 
            else:
                real_vals.append(0)
                
            # de ideale optie is altijd 3, want dat is de maximale relevance
            ideal_vals.append(3)
        
        

        ndcg = dcg(real_vals) / dcg(ideal_vals)
   
        ndcg_list.append(ndcg)
        
    #print(f' k_1=1.0, b=0.85 & k_3={k_3}  --> nDCG: {np.mean(ndcg_list)}')
        
    return np.mean(ndcg_list)



<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#DB4437"><b>6.2 Results</b></span>
</div>

<div style = "font-family: Helvetica Neue; font-size:110%"> 
After evaluating the different ranking models using the evaluation metrics above, we got some interesing results. The evaluation scores for our Query Likelihood and Smooth Query Likelihood algorithms are too low to even consider using them for our search engine. Therefore, we have focused on comparing the TF-IDF algorithm with the BM25 algorithm. This resulted in the following evaluation scores: <br><br>

<div style = "font-family: Helvetica Neue; font-size:120%"> 
<span style = "color:#DB4437"><b>Evaluation of TF-IDF</b></span><br>
<img src="eval_tfidf.jpeg" alt="TF-IDF evaluation" style="width: 500px; float: left;"/>
</div>
</div>


<br>

<div style = "font-family: Helvetica Neue; font-size:120%"> 
<span style = "color:#DB4437"><b>Evaluation of BM25</b></span><br>
<img src="eval_bm25.jpeg" alt="TF-IDF evaluation" style="width: 500px; float: left;"/>
</div>
</div>

<div class="alert alert-info", style = "font-family: Helvetica Neue; font-size:110%"> 
<b>In conclusion:</b> the BM25 algorithm has the overall best scores, which is why we've chosen to use BM25 as our ranking algorithm to use for our search engine. 
</div>

<div style = "font-family: Helvetica Neue;; font-size:160%"> 
<span style = "color:#DB4437"><b>6.3 Questions and Answers</b></span>
</div>

<div style = "font-family: Helvetica Neue; font-size:110%"> 
In the milestones, several questions were asked regarding the performance of our functions. We failed to explicitly answer <i>all</i> questions, since it took us almost 3 weeks to create the index in a way that didn't crash our entire notebook and working memory. Because of this, our focus wasn't on subtle differences between provided functions and our own functions. Since we've excessively adjusted and rewritten the code since then, it isn't possible to check the performance of our first few drafts of the code. The questions that we were succesfully able to answer are listed below.

<div class="alert alert-info", style = "font-family: Helvetica Neue; font-size:110%"> 
<dl>
<dt>How much time do you gain by ranking documents using an inverted index, compared to the original code? Report the time it takes to run the model before and after making the index.</dt>
<dd>According to one of the teaching assistents, before using an inverted index the model takes about 100 hours to run. After making the index, it takes about 20 minutes. Clearly, this is a huge difference.</dd><br>
    
<dt>Do you maintain the same value for MRR on the validation set? Report MRR on the validation set before and after making the index.</dt>
<dd>Both before and after making the index, the MRR score is 0.245.</dd><br>
    
<dt>What time does it take to build the index?</dt>
<dd>Building the index takes around 8 minutes.</dd><br>
    
<dt>What time does it take to retrieve the documents?</dt>
<dd>Practically no time, short enough to be disregarded.</dd><br>
    
<dt>How does adding more advanced text processing techniques help your model's performance?</dt>
<dd>When comparing different pre-processing techniques, we see that ASCII scores better than a both the most "basic" regex pattern we created and a regex pattern that tolerates a little bit more characters than our standard one. This is a great example of how your pre-processing technique influences your model's performance. When being too "strict" when filtering out characters, you lose important information and this therefore negatively impacts the overall performance of your algorithm.</dd>
<img src="preprocessing.jpg" alt="TF-IDF evaluation" style="width: 700px; float: left;"/><br>

<br><br><br><br>
<dt>How does stopword removal positively impact your performance?</dt>
<dd>Removing stopwords makes sure that documents aren't falsely ranked higher because of the fact that both the query and the document contain the same stopwords (which are often used in the English language).</dd><br>

<dt>Query Likelihood versus Smooth Query Likelihood</dt>
<dd>Our (smooth) query likelihood model doesn't even remotely score as good as TF-IDF or BM25. However, if we had to choose one or the other, it would be smooth query likelihood. This is an "expansion" on query likelihood and adds what can be seen as an exra layer, which makes the model more complex and the result more accurate.</dd>

    
</dl>
    
</div>

<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#F4B400"><b>7. Creating TREC file</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
This function creates Trec files that are suitable for submitting the ranking score to Codelab. 
</div>

In [21]:
def trec_submision(queries, k_1=1.1, k_3=0, b=0.85, top_k=100, data_size='small', type_='validation'):

    
    trec_file = ""
    
    for query_id in tqdm.notebook.tqdm(queries):
        tokens = queries[query_id]
        ranking = bm25(tokens, k_1, k_3, b, top_k, data_size)
    
            
        for row in range(len(ranking)):
            pid = ranking[row][0]
            rank = row + 1
            trec_line = f"{query_id} {pid} {rank}"
            
            trec_file += str(query_id) + " " + str(pid) + " "+ str(rank) + "\n "

    

    text_file = open(f"TREC/submission_{type_}.text", 'w')
    text_file.write(trec_file)
    text_file.close()
            
        
    return


<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#0F9D58"><b>8. Feature Construction</b></span>
</div>



<div class="alert alert-info", style = "font-family: Helvetica Neue; font-size:110%"> 
In this section the feature vectors that are used for the reranking phase will be constructed. 
    
The feature vectors all contain the following features:
<ul>
    <li>tf-idf score (tfidf)</li>
    <li>query likelihood score (QL)</li>
    <li>bm25 score (bm25)</li>
    <li>query term count (QTC)</li>
    <li>document term count (DTC)</li>
    <li>average word embedding score (AWE)</li>
</ul>

    
The following ranker computes and ranks documents based on the bm25 score but the function returns the tf-idf score and DTC as well. Since our inverted index has been constructed on disk, it takes some time to load the documents into memory. The following function has been written to expediate the process, such that the same document data has to be loaded into memory for every single feature score.
    
In the function <i>get_features()</i> builds a dataframe used for the feature vector. Pickle-files are stored at certain intervals to make sure the vector can be constructed incrementally.
 
</div>

In [22]:
def bm25_plus_tfidf(tokens, k_1=1.05 , k_3=0, b=0.85 , top_k=10, data_size='large'):
    """
    Computes bm25 scores
    """
    
    title_dict = defaultdict(tuple)
    if data_size == 'large':
        path = 'large_index/postings_tf'
    elif data_size == 'small':
        path = 'small_index/small_tf'
    
    for term in tokens:
        
        try:
            postings = postings_loader(f'{path}/{term[0]}/{term}.json')
            meta_data = postings_loader(f'{path}/tf_info.json')
            doc_freq = len(postings)
        
            for document in postings:
                
                title_dict[document['pid']] 
                bm25 = (((k_1+ 1) * document['tf'])/((k_1 * ((1-b)+ (b * (document['length_document']/meta_data['average_doc_length'])))) + document['tf']) * \
                (np.log(meta_data['total_documents']/doc_freq))  * (((k_3 + 1)*document['tf'])/(k_3 + document['tf'])))

                tfidf = ((1+np.log(document['tf']))*(np.log(meta_data['total_documents']/document['length_document'])))

                title_dict[document['pid']] += (bm25, (tfidf, document['length_document']))
                
                
        except:
            pass
        

    titles = [(k, v) for k,v in title_dict.items()]
 
    return sorted(titles, key=lambda m: (-m[1][0],m[0]))[:top_k]


In [23]:
def ranker(tokenised_queries):
    
    short_return = dict()
    
    rank = defaultdict(list)
    for qid in tqdm.notebook.tqdm(tokenised_queries.keys()):
        
        bm = bm25_plus(tokenised_queries[qid], top_k=100, data_size='large')
        for x in bm:
            
            rank[x[1]].append((qid, x[0]),  )
    
    return rank

In [24]:
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import pandas as pd 
import pickle
import time



def get_features(tokenised_queries, labels, data_size='large'):
    start_time = time.time()
    if data_size == 'large':
        path_tf = 'large_index/postings_tf'
        path_lm = 'large_index/postings_lm'
    elif data_size == 'small':
        path_tf = 'small_index/small_tf'
        path_lm = 'small_index/small_lm'
        
    
    rows = []
    alpha=0.1
    
    r = ranker(tokenised_queries)

    count = 0
    for y,x in tqdm.notebook.tqdm(r.items()):
        qid = x[0][0]
        pid = x[0][1]
        count+=1
         
        
        # een QL score van een qid en een pid 
        QL = 0
        tfidf = y[1][0]
        DTC = y[1][1]
        
        for token in tokenised_queries[qid]:
            
            try:
                lm_index = json.load(open(f'{path_lm}/{token[0]}/{token}.json'))

                for document in [x for x in lm_index['postings'] if x["pid"] == pid]:

                    tf = document['term_frequency']
                    cf = lm_index['corpus_frequency']

                    meta = json.load(open(f'{path_tf}/tf_info.json'))
                    u = meta['average_doc_length']

                    try:
                        QL += np.log((tf + (u*cf)/ (DTC+u)))
                    except:
                        QL += np.log( (alpha * cf) + ((1-alpha) * tf))
            except:
                pass
        
        
        if pid in labels[qid]:
            relevance = labels[qid][pid]
        else:
            relevance = 0
    
        QTC = len(tokenised_queries[qid])


        rows.append({
            "qid": qid,
            "pid": pid,
            "tfidf": tfidf,
            "relevance" : relevance,
            "QL": QL,
            "bm25": y[0],
            "QTC": QTC,
            "DTC": DTC,
        })
        
        if count % 100_000 == 0:
            temp_df = pd.DataFrame(rows)
            temp_df.to_pickle(f'pickles/training_df.pkl')
        
    print(f'finished in: {time.time()-start_time} seconds')
    return pd.DataFrame(rows)

In [25]:
def get_features_test(tokenised_queries, tokenised_passages,  vectorizer, data_size='small'):
    if data_size == 'large':
        path_tf = 'large_index/postings_tf'
        path_lm = 'large_index/postings_lm'
    elif data_size == 'small':
        path_tf = 'small_index/small_tf'
        path_lm = 'small_index/small_lm'
    
    rows =[]
    alpha=0.01
    
    r = ranker(tokenised_queries)
    for y,x in tqdm.notebook.tqdm(r.items()):
        qid = x[0][0]
        pid = x[0][1]
        
        # query, title & body vectors
        query_vector = vectorizer.transform([" ".join(tokenised_queries[qid])])
        passage_vector = vectorizer.transform([""])
        
        if pid in tokenised_passages.keys():
            passage_vector = vectorizer.transform([" ".join(tokenised_passages[pid])])
        
        # DENSE query, title & body vectors
        query_vector_dense = query_vector.todense() 
        passage_vector_dense = passage_vector.todense()
        
        
        tfidf = cosine_similarity(query_vector, passage_vector)[0][0]
         
        # een QL score van een qid en een pid 
        QL = 0
        tfidf_sim = 0 
        DTC = 0
        
        for token in tokenised_queries[qid]:
            try:
                index = json.load(open(f'{path_tf}/{token[0]}/{token}.json'))
                tf_info = json.load(open(f'{path_tf}/tf_info.json'))
                
                for document in index:
                    if document['pid'] == pid: 
                        tfidf_sim +=  (1+np.log(document['tf']))*\
                        (np.log(tf_info['total_documents']/document['length_document']))
            except:
                tfidf_sim+= 0
                
            try:
                index = json.load(open(f'{path_lm}/{token[0]}/{token}.json'))
                tf = index['postings'][0]['term_frequency']
                cf = index['corpus_frequency']

                meta = json.load(open(f'{path_tf}/tf_info.json'))
                u = meta['average_doc_length']

                DTC = len(tokenised_passages[pid])

                QL += np.log((tf + (u*cf)/ (DTC+u)))

            except:
                QL += np.log( (alpha * cf) + ((1-alpha) * tf))
        
    
        QTC = len(tokenised_queries[qid])
        
        
        rows.append({
            "qid": qid,
            "pid": pid,
            "tfidf": tfidf_sim,
            "tfidf similarity": tfidf,
            "QL": QL,
            "bm25": y,
            "QTC": QTC,
            "DTC": DTC,
        })
    
    return pd.DataFrame(rows)

<div class="alert alert-info">

<h3>Instead of constructing the entire feature vectors, we load in our pre-constructed feature vectors.</h3>
    
    


    
</div>

<div id = "data", style = "font-family: Helvetica Neue;; font-size:190%"> <span style = "color:#4285F4"><b>9 AWE similarity</b></span>
</div>

In [27]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import gensim.downloader as api
from gensim.models.word2vec import Word2Vec
import gensim

<div class="alert alert-info">

<h2><u>First let's define</u></h2>

**Word embeddings** <br>
*The numerical representation of a text.*

**Pretrained Word Embeddings** <br>
Pretrained Word Embeddings are the embeddings learned in one task that are used for solving another similar task.

Word2Vec is classified into two approaches:

1. Continuous Bag-of-Words (CBOW)
2. Skip-gram model

**Continuous Bag-of-Words (CBOW) model** <br>
Learns the focus word given the neighboring words.

**Skip-gram model** <br>
Learns the neighboring words given the focus word. 

*Continous Bag Of Words and Skip-gram are inverses of each other.*
    
</div>

<div class="alert alert-warning">

This function computes the average word embedding for every <b> query + doc</b> combination in <i> data</i>. We use 50 dimensions for the embeddings and only compute the AWE score if the token exists in the corpus. The corpus is created with the snippet of code below.


</div>

In [28]:
from gensim import utils
from gensim.test.utils import datapath

# corpus maken waar model op getraind wordt 
class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)
            
# sentences = MyCorpus()
# model = Word2Vec(sentences = sentences, vector_size = 50)

2022-10-23 20:17:14,896 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-10-23 20:17:14,896 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2022-10-23 20:17:14,897 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2022-10-23T20:17:14.897006', 'gensim': '4.2.0', 'python': '3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}


In [29]:
# %memit tokenised_passages = passage_loader("data/large_tokenised_passages.json")

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

def AWE(data, tokenised_queries, tokenised_passages, model):
    
    sims = []
    
    for i, row in data.iterrows():
        
        qid = row["qid"]
        pid = row["pid"]

        # deze if/else is niet nodig als we large gebruiken, voor nu wel
        if pid not in tokenised_passages.keys():   
            sims.append(0)
            
        else:
            
            # tokenised query/passage opvragen (zijn lijsten met woorden)
            query = tokenised_queries[qid]
            passage = tokenised_passages[pid]

            # zit het woord in de corpus?
            q = [token for token in query if token in model.wv.key_to_index]
            p = [token for token in passage if token in model.wv.key_to_index]
            
            # met 1 woord kan t gemiddelde niet
            if len(q) >= 1 and len(p) >= 1:
                q_embedding = np.mean(model.wv[q], axis = 0)
                p_embedding = np.mean(model.wv[p], axis = 0)

                sim = cosine_similarity(q_embedding.reshape(1,-1), p_embedding.reshape(1,-1))[0][0]
                sims.append(sim)
            
            else:
                sims.append(0)
        
    return data.assign(AWE = sims)


<div class="alert alert-success", style = "font-family: Helvetica Neue; font-size:110%"> 
As output of the <b>AWE()</b> function, a dataframe is returned with a row for every <b> query id and passage id combination</b>. The following is an examlpe of the output structure:
</div>


<img src="dataframe.PNG" alt="TF-IDF evaluation" style="width: 600px; float: left;"/>

<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#DB4437"><b>10 XGBRanking </b></span>
</div>

<div class="alert alert-warning">

Firstly, the PairwiseRanker is called. Secondly, we call a function named <b>XGBranker()</b> that reranks the input dataframe based on the PairwiseRanker.

</div>

In [52]:
# Feature vec trec file to df function then call xgboost and submit those results
features = []
        
with codecs.open("output/best feature vecs/feature_vector_validation.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split(' ')
        features.append(content[:-1])


validation_df = pd.DataFrame(data=features, columns=['qid', 'pid', 'tfidf', 'QL', 'bm25', 'QTC','DTC', 'tfidf similarity', 'AWE', 'relevance'])

features = []
        
with codecs.open("output/best feature vecs/feature_vector_test.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split(' ')
        features.append(content[:-1])


test_df = pd.DataFrame(data=features, columns=['qid', 'pid', 'tfidf', 'QL', 'bm25', 'QTC','DTC', 'tfidf similarity', 'AWE'])


features = []
        
with codecs.open("output/best feature vecs/feature_vector_training.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split(' ')
        features.append(content[:-1])


training_df = pd.DataFrame(data=features, columns=['qid', 'pid', 'tfidf', 'QL', 'bm25', 'QTC','DTC', 'tfidf similarity', 'AWE' , 'relevance'])


In [77]:
import xgboost as xgb
from xgboost import XGBRanker


class PairwiseRanker:

    def _init_(self):
        self.model = None
    
    def fit(self, train_df: pd.DataFrame):
        X_train = np.array(training_df[["tfidf", "QL", "bm25", "QTC","DTC"]].values.tolist(), dtype=object)
        y_train = np.array(training_df.relevance.values.tolist(), dtype=object)
        
        group = training_df.groupby('qid').size().to_frame('size')['size'].to_numpy()
        self.model = xgb.XGBRanker(random_state=0, objective='rank:ndcg').fit(X_train, y_train, group=group)
        ...
    
    def predict(self, test_df: pd.DataFrame):
        X_test = np.array(test_df[["tfidf", "QL", "bm25", "QTC","DTC"]].values.tolist(), dtype=object)
        predicted_relevance = self.model.predict(X_test)
        

        
        return predicted_relevance
    
reranker = PairwiseRanker()


In [78]:
from natsort import index_natsorted

def XGBranker(df, reranker):
    
    dataframe = pd.DataFrame()
    
    rank_df = df[['qid', 'pid']]
    
    reranker.fit(training_df)
    ranking = reranker.predict(df)
    
    rank_df['reranker'] = ranking
    
    qids = sorted(set([i for i in df['qid']]))
    
    for i in qids:
        small_df = rank_df[rank_df['qid'] == i]
        df = small_df.sort_values(by="reranker",
                             key=lambda x:np.argsort(index_natsorted(-small_df["reranker"])
                                                    ))
     
        
        dataframe = dataframe.append(df)
    return dataframe
    
validation_results = XGBranker(validation_df, reranker)
test_results = XGBranker(test_df, reranker)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rank_df['reranker'] = ranking


<div class="alert alert-warning">

The function <b>trec_reranked()</b> computes the reranked DataFrame into a TREC file suitable for submitting to CodaLab.

</div>

In [79]:


def trec_reranked(reranked_df, file_name):
    trec_file = ""
    rank =1
    last_qid = 'qid_8201'
    count = 0
    for i, row in tqdm.notebook.tqdm(reranked_df.iterrows()):
        
        qid = row['qid']
        pid = row['pid']
        
        trec_file += f"{qid} {pid} {rank} \n"
        rank += 1

    text_file = open(f"TREC/{file_name}.text", 'w')
    text_file.write(trec_file)
    text_file.close()
    
    return


filename = 'xgboost_ranking.zip'
trec_reranked(validation_results, "submission_validation")
trec_reranked(test_results, "submission_test")


validation_file = "TREC/submission_validation.text"
test_file = "TREC/submission_test.text"

with ZipFile("TREC/"+filename, 'w') as zipObj:
    zipObj.write(validation_file, "submission_validation.text")
    zipObj.write(test_file, "submission_test.text")



0it [00:00, ?it/s]

0it [00:00, ?it/s]

<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#F4B400"><b>11 Creating TREC file for RankNet</b></span>
</div>

<div class="alert alert-warning">

The function <b>trec_ranknet()</b> computes the reranked DataFrame into a TREC file.

</div>

In [None]:
def trec_ranknet(df_awe, file_name):
    trec_file = ""

    for i, row in tqdm.notebook.tqdm(df_awe.iterrows()):

        qid = row['qid']
        pid = row['pid']
        relevance = row['relevance']
        tf = row['tfidf']
        ql = row['QL']
        bm = row["bm25"]
        QTC = row['QTC']
        DTC = row['DTC']
        AWE = row['AWE']

        trec_file += f"{qid} {pid} {tf} {ql} {bm} {QTC} {DTC} {AWE} \n"


    text_file = open(f"output/{file_name}.text", 'w')
    text_file.write(trec_file)
    text_file.close()
    
    return 

# trec_ranknet(data_awe_validation, 'feature_vector_validation')

def trec_ranknet_test(df_awe, file_name):
    
    trec_file = ""

    for i, row in tqdm.notebook.tqdm(df_awe.iterrows()):

        qid = row['qid']
        pid = row['pid']
        tf = row['tfidf']
        ql = row['QL']
        bm = row["bm25"]
        QTC = row['QTC']
        DTC = row['DTC']
        AWE = row['AWE']

        trec_file += f"{qid} {pid} {tf} {ql} {bm} {QTC} {DTC} {AWE} \n"

    
    text_file = open(f"output/{file_name}.text", 'w')
    text_file.write(trec_file)
    text_file.close()
    
    return 

<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#0F9D58"><b>12 RankNet</b></span>
</div>

<div class="alert alert-warning", style = "font-family: Helvetica Neue; font-size:110%"> 
The code below calls the RankNet module and reranks our data using this neural network. 
</div>

In [None]:
# hyperparameters for RankNet
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=15)
parser.add_argument("--lr", type=float, default=0.001)
parser.add_argument("--input_size", type=int, default=5)
parser.add_argument("--hidden_size1", type=int, default=128)
parser.add_argument("--hidden_size2", type=int, default=128)
parser.add_argument("--output_size", type=int, default=1)
parser.add_argument("--batch_size", type=int, default=512)
parser.add_argument("--random_seed", type=int, default=0)
args = parser.parse_known_args()[0]

<div style = "font-family: Helvetica Neue; font-size:110%"> 
    Also, we need to ensure reproducibility.

In [None]:
np.random.seed(args.random_seed)
torch.manual_seed(args.random_seed)
torch.cuda.manual_seed_all(args.random_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = True

In [None]:
from ranknet import train, inference

<div style = "font-family: Helvetica Neue; font-size:110%"> 
Next, you need to train RankNet on the training set.

In [None]:
# load the full-ranking result on the training set.
"{qid} {pid} {tf} {ql} {bm} {QTC} {DTC}  {AWE} \n"
"            [2]  [3]  [4]   [5]   [6]    [8]       "

q_id = []
features = []
labels = []
        
print("Load file {}".format("output/best feature vecs/feature_vector_training.text"))
with codecs.open("output/best feature vecs/feature_vector_training.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split(' ')
       
        q_id.append(content[0]) 
        features.append([float(content[2]), float(content[3]), float(content[4]), float(content[5]), float(content[6])])
        labels.append(labels_training[content[0]][content[1]] if content[1] in labels_training[content[0]] else 0)

# train model
%memit train(args, q_id, features, labels)

Then, you need to conduct inference on the validation set.

In [None]:
# load the full-ranking result on the validation set.

print("Load file {}".format("output/feature_vector_validation.text")) 
q_id = []
p_id = []
features = []
        
with codecs.open("output/feature_vector_validation.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split(' ')
        q_id.append(content[0]) 
        features.append([float(content[2]), float(content[3]), float(content[4]), float(content[5]), float(content[6])])
        p_id.append(content[1])

# conduct inference on the validation set.
%memit scores = inference(args, q_id, p_id, features) 

# rank the calclulated scores from largest to smallest.
for q_id, p2score in scores.items():
    sorted_p2score=sorted(p2score.items(), key=lambda x:x[1], reverse = True)
    scores[q_id]=sorted_p2score
        
with codecs.open("output/re_ranking_validation_result.text", "w", "utf-8") as file:
    for q_id, p2score in scores.items():
        ranking=0
        for (p_id, score) in p2score:
            ranking+=1           
                    
            file.write('\t'.join([q_id, p_id, str(ranking), str(score), "re_ranking_on_the_validation_set"])+os.linesep)

# output the result file. 
print("Produce file {}".format("re_ranking_validation_result.text")) 

<div style = "font-family: Helvetica Neue; font-size:110%"> 
Similarly, you need to conduct inference on the test set.

In [None]:
# load the full-ranking result on the test set.

print("Load file {}".format("output/feature_vector_test_result.text")) 
q_id = []
p_id = []
features = []
        
with codecs.open("output/feature_vector_test.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split(' ')
        q_id.append(content[0]) 
        features.append([float(content[2]), float(content[3]), float(content[4]), float(content[5]), float(content[6])])
        p_id.append(content[1])

# conduct inference on the validation set.
%memit scores = inference(args, q_id, p_id, features) 

# rank the calclulated scores from largest to smallest.
for q_id, p2score in scores.items():
    sorted_p2score=sorted(p2score.items(), key=lambda x:x[1], reverse = True)
    scores[q_id]=sorted_p2score
        
with codecs.open("output/re_ranking_test_result.text", "w", "utf-8") as file:
    for q_id, p2score in scores.items():
        ranking=0
        for (p_id, score) in p2score:
            ranking+=1           
                    
            file.write('\t'.join([q_id, p_id, str(ranking), str(score), "re_ranking_on_the_test_set"])+os.linesep)

# output the result file. 
print("Produce file {}".format("re_ranking_test_result.text")) 

<div class="alert alert-success", style = "font-family: Helvetica Neue; font-size:110%"> 
The overall output of the code above are two TREC files reranked by RankNet.
</div>


<div class="alert alert-info">

<h2><u>The following cells will create the re-ranked lists for the validation and test sets in TREC format.</u></h2>

They have a qid-pid-rank structure
    
</div>

<div id = "data", style = "font-family: Helvetica Neue;; font-size:190%"> <span style = "color:#4285F4"><b>13 Conclusion</b></span>
</div>

<div style = "font-family: Helvetica Neue; font-size:110%"> 
The submission files have been created. They can be submitted to for testing and scoring. In our  

<div style = "font-family: Helvetica Neue;; font-size:190%"> 
<span style = "color:#DB4437"><b>14 Submission to CodaLab Leaderboard</b></span>
</div>

query_id, passage_id , rank

In [None]:
# zip results
studentnumber = "result"
studentname = "DTF"

# Filename of submission Zip Archive to upload to CodaLab
filename = f"{studentnumber}_{studentname}_codalab_submission.zip"

# Filename of submission .text file from validation and test sets
# please make sure these are the rankings you want to submit, and make sure they are in the proper format
# Format: query_id    passage_id    rank     ... other things you write in a line are not taken into account
validation_file = "output/re_ranking_validation_result.text"
test_file = "output/re_ranking_test_result.text"

with ZipFile("output/"+filename, 'w') as zipObj:
    zipObj.write(validation_file, "submission_validation.text")
    zipObj.write(test_file, "submission_test.text")