<a href="https://colab.research.google.com/github/heba14101998/IR-in-Arabic/blob/master/Summer2021/labs/day4/IR_in_Arabic_Lab4_RankedRetrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **IR in Arabic** - Summer 2021 lab day4


This is one of a series of Colab notebooks created for the **IR in Arabic** course. It demonstrates how can we perform ranked retrieval and evaluate the output.

The **learning outcomes** of this notebook are:


*   Retrieval using a vector space model and a BM25 model.
*   Evaluating the results.


### **Setup**
We will first install Pyterrier as follows:

In [None]:
#install the Pyterrier framework
!pip install python-terrier -q

The next step is to initialize PyTerrier. This is performed using PyTerrier's init() method. The init() method is needed as PyTerrier must download Terrier's jar file and start the Java virtual machine. We prevent init() from being called more than once by checking started().

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

Another library that we need for this lab is Arabic-Stopwords

In [None]:
#install the Arabic stop words library
!pip install Arabic-Stopwords

We will import all the python libraries needed for this lab

In [None]:
#we need to import the following libraries.
import pandas as pd
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)
import numpy as np
import re
from snowballstemmer import stemmer
from tqdm import tqdm
import arabicstopwords.arabicstopwords as stp

We will prepare our helper functions for removing stop words, normalize, and stemming which we will use to process our queries.

In [None]:
#removing Stop Words function
def remove_stopWords(sentence):
    terms=[]
    stopWords= set(stp.stopwords_list())
    for term in sentence.split() :
        if term not in stopWords :
           terms.append(term)
    return " ".join(terms)

#a function to normalize the tweets
def normalize(text):
    text = re.sub("[إأٱآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    return(text)

#define the stemming function
ar_stemmer = stemmer("arabic")
def stem(sentence):
    return " ".join([ar_stemmer.stemWord(i) for i in sentence.split()])

We will use our indexed **EveTAR** dataset. The index is uploaded in our Github repository so we will access it as follows:

In [None]:
# %rm -rf IR-in-Arabic
# %rm -rf evetarIndex
!git clone https://github.com/telsayed/IR-in-Arabic.git
!unzip IR-in-Arabic/Summer2021/data/EveTAR/evetarIndex.zip -d evetarIndex
!ls evetarIndex

Next, we will load our index. We jus need the data.properties file to load our index.

In [None]:
#we will load the index
index_ref = pt.IndexRef.of("./evetarIndex/data.properties")
index = pt.IndexFactory.of(index_ref)

### **Vector space model retrieval**
We will use BatchRetrieve Pyterrier class for retrieval and TF-IDF as the weighting model. You can check the weighting models supported by PyTerrier [here](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html).

In [None]:
#set up our retieval model by specifing TF_IDF as wmodel and limiting the number of retrieved results for each query top 10 documents
tfidf_retr = pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"},num_results=10)

You can query using a simple string. **Note:** you need to preprocess the query using the same processing steps you performed before indexing.

In [None]:
#we need to process the query also as we did for documents
def preprocess(sentence):
  # apply preprocessing steps on the given sentence
  # students ToDo .....
  # here you should write your code
  return sentence

query="العربية"
query = preprocess(query)
#we will call the search function using our retrieval model we set up above
results=tfidf_retr.search(query)
results

Let's check the tweets retrieved by getting the tweets text from our collection.

In [None]:
dataset_links=["https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-01.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-02.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-03.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-04.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-05.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-06.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-07.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-08.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-09.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-10.txt"]

full_data=pd.DataFrame()
for i in tqdm(range(len(dataset_links))):
    tweets=pd.read_csv(dataset_links[i], sep='\t')
    full_data=pd.concat([full_data,tweets],ignore_index=True)
full_data.reset_index(inplace=True,drop=True)
#the docno will be our tweetID
full_data["docno"]=full_data["tweetID"].astype(str)
#select tweet text for the tweets retrieved only
full_data[full_data["docno"].isin(results["docno"].tolist())]


Let's try another query.

In [None]:
#we need to process the term also as we did for documents
query="أمريكا دولار"
#preprocess
query = preprocess(query)
#we will call the search function using our retrieval model we set up above
results=tfidf_retr.search(query)
results

Let's check our results

In [None]:
# retrieve the tweets text for the retrieved tweets just to check our results
full_data[full_data["docno"].isin(results["docno"].tolist())]

We will load queries (topics titles) that are already defined and released with EveTAR dataset and process using the same processing steps we did when we indexed EveTAR.

In [None]:
#read the topics file from Github and use the titles as queries
topics=pd.read_csv("https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/topics.txt", sep='\t',names=['data'])
queries=[]
qid=[]
#we will get the queries and their ids from the topics file
for i in range(len(topics)):
    splitted=topics["data"][i:i+1][i].split(" ")
    if splitted[0]=="<title>":
       queries.append(' '.join(splitted[1:]))
    if splitted[0]=="<num>":
       qid.append(splitted[2])

queriesDF=pd.DataFrame()
queriesDF["qid"]=qid
queriesDF["raw_query"]=queries
#remove the stopwords from queries, do normalization, and apply stemming
queriesDF["query"]=queriesDF["raw_query"].apply(preprocess)

queriesDF

We can retrieve the relevant documents to a set of queries using the **transform** function. We will use the set of queries we prepared earlier. The input should be a dataframe containing the **qid** and **query**. The function will return the same dataframe but with an extra 4 columns **docid**, **docno**, **rank**, and **score**.

In [None]:
#the queries dataframe should have qid and query columns
tfidf_res=tfidf_retr.transform(queriesDF)
tfidf_res

### **BM25**

We will initialize our BM25 retrieval model by using the BatchRetrieve class and setting the weighting model to BM25.

In [None]:
#specify BM25 as wmodel
bm25_retr = pt.BatchRetrieve(index, controls = {"wmodel": "BM25"},num_results=10)
#the queries dataframe should have qid and query columns
bm25_res=bm25_retr.transform(queriesDF)
bm25_res

### **Evaluating our results**
To evaluate the results we need qrels (relevance judgements). The qrels should be in [TREC format](https://trec.nist.gov/).

In [None]:
qrels=pd.read_csv("https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/qrels.txt", sep='\t',names=['qid','Q0','docno','label'])
qrels['docno']=qrels['docno'].astype(str)
# qrels are in TREC format
qrels = qrels[qrels["docno"].isin(full_data["docno"].tolist())] # to choose qrels for the chosen 50k documents
qrels

Let's see an example of a dataset with graded relevance judgement.

In [None]:
#check the following dataset available by PyTerrier. The relevance is graded
pt.get_dataset("trec-robust-2004").get_qrels()

In [None]:
#check the unique labels
pt.get_dataset("trec-robust-2004").get_qrels()['label'].unique()

To evaluate our results we will use Pyterrier Utils.evaluate function. This function take the results and the qrels dataframe containing three columns which are **qid, docno, label.**

You can add the following parameters:


*   **metrics**: default = ["map", ndcg"], select the evaluation metrics

*   **perquery**: default = False, select whether to show the mean of the metrics or the metrics for each queryList item





In [None]:
# Here, we are evaluating TF_IDF retrieval model
eval = pt.Utils.evaluate(tfidf_res,qrels[['qid','docno','label']],metrics=["map","recall","P"])
eval

In [None]:
# Here, we are evaluating BM25 retrieval model
eval = pt.Utils.evaluate(bm25_res,qrels[['qid','docno','label']],metrics=["map","recall","P"])
eval

In [None]:
# Here, we are evaluating bm25 retreival model BUT with activating perquery flag
eval = pt.Utils.evaluate(bm25_res,qrels[['qid','docno','label']],metrics=["map","recall","P"],perquery=True)
eval

### **Exercise 1**
Select three queries from our 50 queries and retrieve the top 25 relevant tweets for those queries using both the TF-IDF and the BM25 retrieval models. Evaluate your results in terms of precision only.

In [None]:
# write your solution here

### **Exercise 2**
Given the following queries:

['E14' 'E48' 'E36' 'E58' 'E19' 'E63' 'E30' 'E27' 'E39' 'E21']

1. Retreive the top 1000 relevant documents using BM25.
2. Retrieve the text for both queries and documents and make them into one dataframe.
3. Save the resulted dataframe into a text file.


In [None]:
selected_queries = ['E14','E48', 'E36', 'E58', 'E19', 'E63', 'E30', 'E27', 'E39', 'E21']
# write your solution here

### **References**


* [PyTerrier  retrieval and evaluation notebook](https://github.com/terrier-org/pyterrier/blob/master/examples/notebooks/retrieval_and_evaluation.ipynb).
*   [PyTerrier documentation.](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/)

