# Wordnet-based Query Expansion
This notebook implements a WordNet-based query expansion method, and compares it against a control conditions and RM3 query expansion. The notebook allows the user to evaluate multiple conditions for multiple data sets. What exactly the notebook does and how to use it is explained in the text blocks.

## Imports & Downloads
For this project you specifically need pyersini version 0.9.4.0 to obtain the same results as specified in the report. Therefore we here specifically install that version in the following codeblock. Note that you also need a correct Java 11 installation to run this, as we are using Anserini, which is Java based. To simplify this we recommend running this notebook in Google Colab.

In [1]:
%%capture
!pip install pyserini==0.9.4.0

import os
import nltk
import sys
import time
import pandas as pd
from pyserini.search import get_topics
from pyserini.search import SimpleSearcher
from pyserini.search import querybuilder
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from IPython.display import clear_output

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

## Data Sets
The following three codeblock retrieve the needed data sets with prebuilt indexes available. To double check if nothing changed about these uploads you can use codeblock four to double check if they're the same size

In [None]:
# Get Robust04 Dataset ~2 min
%%capture
!wget https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz
# Backup URL: https://www.dropbox.com/s/s91388puqbxh176/index-robust04-20191213.tar.gz
!tar xvfz index-robust04-20191213.tar.gz

In [None]:
# Get MsMarcoPassage Dataset ~2 min
%%capture
!wget https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-passage-20191117-0ed488.tar.gz
!tar xvfz index-msmarco-passage-20191117-0ed488.tar.gz

In [None]:
# Get MsMarcoDocument Dataset ~20 min
%%capture
!wget https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-doc-20201117-f87c94.tar.gz
!tar xvfz index-msmarco-doc-20201117-f87c94.tar.gz

In [None]:
# Sanity check of Robust04: 2.1G
!du -h index-robust04-20191213

# Sanity check of MsMarcoPassage: 2.5G
!du -h index-msmarco-passage-20191117-0ed488

#Sanity check of MsDocPassage: 16G
!du -h index-msmarco-doc-20201117-f87c94

2.1G	index-robust04-20191213
2.5G	index-msmarco-passage-20191117-0ed488
16G	index-msmarco-doc-20201117-f87c94


## Helper Function
To make the results more readable at the end, we store them in a .txt file temporarily. At the end we will able to convert it do a dataframe easier while also being able to download the .txt for easy data export. 

In [2]:
# Write to .txt file to store analysis
def setStdOutToFile():
  old_stdout = sys.stdout
  writer = open('stdout.txt', 'a')
  sys.stdout = writer
  return writer, old_stdout

# Close writer and reset stdout
def resetStdOut(writer, old_stdout):
  writer.close()
  sys.stdout = old_stdout

# Clear the output of a code block for nicer notebook
def clearOutput():
  clear_output(wait=True)
  print("", flush=True)

# Query Expansion
The following codeblocks are the code that implement the novel query expansion based on WordNet. The notebook also allows you to run the other conditions as mentioned in the project, namely the control condition (no query expansion) and RM3 query 
expansion. In the following codeblock you can change the integer variable 'QEMethod' to change which condition is used:
1. Control Condition
2. WordNet-based Expansion
3. RM3 Expansion

In [None]:
# QEMethod = 1 # Control Condition
QEMethod = 2 # WordNet-based Expansion
# QEMethod = 3 # RM3 Expansion

In [None]:
def build_query(query, limit=0, pos=True):
    """Expand the query.
        Parameters
        ----------
        query : str
            Query string.
        limit : int
            Determines the maximum amount of word expansions per query term, 
            not restricted if limit=0.
        pos: bool
            Determines whether or not to apply part of speech tagging 
            to the query expansion algorithm.
        Returns
        -------
        str
            Expanded query
    """
    words = query.split()
    tagged_query = nltk.pos_tag(words, tagset='universal')
    stop_words = set(stopwords.words('english'))
    filtered_words = [w for w in words if not w in stop_words]
    filtered_tagged_words = [w for w in tagged_query if not w[0] in stop_words]
    expanded_words = set()

    for word in filtered_tagged_words:
      expanded_words.add(word[0])
      starting_length = len(expanded_words)
      for syn in wordnet.synsets(word[0]):
        for l in syn.lemmas():
          if l.name().lower() not in stop_words:
            synonym = l.name()
            if pos:
              if limit == 0 or len(expanded_words) < starting_length + limit:
                tagged_synonym = nltk.pos_tag(nltk.word_tokenize(synonym), tagset='universal')
                if word[1] == tagged_synonym[0][1]:
                  expanded_words.add(l.name())
            else:
              expanded_words.add(l.name()) 

    new_query = ""
    for word in expanded_words:
      new_query = new_query + " " + word
    return new_query

In [None]:
def run_all_queries(file, topics, searcher):
    with open(file, 'w') as runfile:
        cnt = 0
        print('Running {} queries in total'.format(len(topics)))
        for id in topics:        
            query = topics[id]['title']

            if (QEMethod == 1 or QEMethod == 3):
              # FOR CONTROL CONDITION:
              hits = searcher.search(query, 1000)
            
            if (QEMethod == 2):
              # FOR WORDNET EXPANSION:
              new_query = build_query(query, limit=2, pos=True) 
              hits = searcher.search(new_query, 1000)

            for i in range(0, len(hits)):
                _ = runfile.write('{} Q0 {} {} {:.6f} Anserini\n'.format(id, hits[i].docid, i+1, hits[i].score))
            cnt += 1
            if cnt % 100 == 0:
                print(f'{cnt} queries completed')

## Evaluation
The following codeblock actually run the query expansion methods and evaluate their performance using standard treceval evaluation tools as available in Pyserini/Anserini. We provide an upper bound estimate in how long it takes, based on our runs in Google Colab. Note that the runtime is heavily influenced by conditions and parameters, and we have provided the average runtime for the slowest combination. Thus most likely it will be quite a bit faster on average depending on your chosen condition + parameters. However, as the runtime still depends on the power of your CPU, take it with a grain of salt. 

### Robust04: < 1 min

In [None]:
start = time.perf_counter()
searcher = SimpleSearcher('index-robust04-20191213')
if (QEMethod == 3):
  searcher.set_rm3(10, 10, 0.5)
topics = get_topics('robust04')
run_all_queries('run-robust04-bm25.txt', topics, searcher)
!wget -O jtreceval-0.0.5-jar-with-dependencies.jar https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar
!wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.robust04.txt
writer, old_stdout = setStdOutToFile()
print("Robust04")
print("time                  \tall\t", round(time.perf_counter()-start)) #Timer in seconds
!java -jar jtreceval-0.0.5-jar-with-dependencies.jar qrels.robust04.txt run-robust04-bm25.txt
resetStdOut(writer, old_stdout)
clearOutput()




### MsMarcoPassage: < 35 min

In [None]:
start = time.perf_counter()
searcher = SimpleSearcher('index-msmarco-passage-20191117-0ed488')
if (QEMethod == 3):
  searcher.set_rm3(10, 10, 0.5)
topics = get_topics('msmarco_passage_dev_subset')
run_all_queries('run-msmarco-passage-bm25.txt', topics, searcher)
!wget -O jtreceval-0.0.5-jar-with-dependencies.jar https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar
!wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt
writer, old_stdout = setStdOutToFile()
print("MsMarcoPassage")
print("time                  \tall\t", round(time.perf_counter()-start)) #Timer in seconds
!java -jar jtreceval-0.0.5-jar-with-dependencies.jar qrels.msmarco-passage.dev-subset.txt run-msmarco-passage-bm25.txt
resetStdOut(writer, old_stdout)
clearOutput()




### MsMarcoDocument: < 5 hours

In [None]:
##### MsMarcoDoc ##### ~120 min 
start = time.perf_counter()
searcher = SimpleSearcher('index-msmarco-doc-20201117-f87c94')
if (QEMethod == 3):
  searcher.set_rm3(10, 10, 0.5)
topics = get_topics('msmarco_doc_dev')
run_all_queries('run-msmarco-doc-bm25.txt', topics, searcher)
!wget -O jtreceval-0.0.5-jar-with-dependencies.jar https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar
!wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt
writer, old_stdout = setStdOutToFile()
print("MsMarcoDoc")
print("time                  \tall\t", round(time.perf_counter()-start)) #Timer in seconds
!java -jar jtreceval-0.0.5-jar-with-dependencies.jar qrels.msmarco-doc.dev.txt run-msmarco-doc-bm25.txt
resetStdOut(writer, old_stdout)
clearOutput()

## Data Visualization and Export
The following code is necessary to convert the data of the evaluation to a dataframe and show the dataframe. Furthermore, the data can be downloaded as the "stdout.txt" will contain all the results.

In [4]:
# Convert .txt to table and print it 
resetStdOut(writer, old_stdout)
stdout_file = open('stdout.txt', 'r') 
lines = stdout_file.readlines() 
  
count = 0
datasets = ["Robust04", "MsMarcoPassage", "MsMarcoDoc"]
column_names = ["Dataset", "MAP", "Recip Rank", "P@5", "Num Rel", "Num Rel Ret" , "Time"]
keep_lines = [2, 6, 7, 8, 12, 24]
max_line = 32
df = pd.DataFrame(columns = column_names)

metric_list = []
dataset_name = ""

# Strips the .txt and converts it to a dataframe
for line in lines: 
  if line.strip() in datasets:
    count = 0
    metric_list = []
    dataset_name = line.strip()
  count +=1 
  if count in keep_lines:
    line = line.strip().split()[2]
    metric_list.append(float(line))
  if count == max_line:
    temp_dict = {'Dataset': dataset_name,
                 'MAP': metric_list[3],
                 'P@5': metric_list[5],
                 'Recip Rank': metric_list[4],
                 'Num Rel': metric_list[1],
                 'Num Rel Ret': metric_list[2],
                 'Time': metric_list[0]} # seconds
    df = df.append(temp_dict, ignore_index=True)

df

Unnamed: 0,Dataset,MAP,Recip Rank,P@5,Num Rel,Num Rel Ret,Time
0,Robust04,0.2047,0.5766,0.4008,17412.0,9017.0,53.0
1,MsMarcoPassage,0.155,0.1575,0.0473,7436.0,6064.0,789.0
2,MsMarcoDoc,0.1388,0.1388,0.0415,5193.0,4154.0,13296.0
