# IR Lab SoSe 2024: Lemmatization Baseline

This jupyter notebook serves as baseline example for using BM25 with lemmatization instead of the Porter Stemmer.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build two BM25 retrieval systems with [PyTerrier](https://github.com/terrier-org/pyterrier), one that uses the default PorterStemmer, and one that uses the Stanford Lemmatizer.

In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
import pyterrier as pt
from pathlib import Path
import os

# do not truncate text in the dataframe
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [3]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
tira = Client()
stanford_tokenizer_jar = tira.load_resource("custom-terrier-token-processing-1.0-SNAPSHOT-jar-with-dependencies.jar")

In [4]:
# Ensure that the stanford tokenizer is available on the pyterrier classpath

pyterrier_resource_dir = Path.home() / '.pyterrier'
os.makedirs(pyterrier_resource_dir, exist_ok=True)
try:
    os.remove(pyterrier_resource_dir / 'custom-terrier-token-processing-0.0.1.jar')
except:
    pass

os.symlink(stanford_tokenizer_jar, pyterrier_resource_dir / 'custom-terrier-token-processing-0.0.1.jar')

In [5]:
ensure_pyterrier_is_loaded(boot_packages=("com.github.terrierteam:terrier-prf:-SNAPSHOT", "mam10eks:custom-terrier-token-processing:0.0.1"))
from jnius import autoclass

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Playing a bit around: Stemming vs Lemmatization

In [6]:
def lemmatize(t):
    lemmatizer = autoclass("org.terrier.terms.StanfordLemmatizer")()
    return lemmatizer.stem(t)

def stem(t):
    stemmer = autoclass("org.terrier.terms.PorterStemmer")()
    return stemmer.stem(t)

def analyze(t):
    print(f'{t} => lemma: "{lemmatize(t)}"; porter: "{stem(t)}"')


In the following, we do some brainstorming to obtain candidate information needs.

In [7]:
analyze('corpus')
analyze('corpora')

corpus => lemma: "corpus"; porter: "corpu"
corpora => lemma: "corpus"; porter: "corpora"


Hypothesis: The lemmatizer should receive a higher recall for queries like `argument corpora`, `math corpora`, `math corpus`, etc., as the lemmatizer maps both terms corpora and corpus to corpus.

In [8]:
analyze('booking')

booking => lemma: "booking"; porter: "book"


Hypothesis: The lemmatizer should receive a higher precision for queries like `booking experience`, `booking page`, `increase booking ratio`, etc., as the porter stemmer should retrieve `book` matches not retrieved by the lemmatizer.


### Step 3: Load the Dataset, the Index, and define the Retrieval Pipeline


In [10]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240411-training')

In [15]:
# A (pre-built) PyTerrier index with Porter Stemmer loaded from TIRA
index_porter = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

bm25_porter = pt.BatchRetrieve(index_porter, wmodel="BM25") %3 >> pt.text.get_text(pt_dataset, "text")

In [12]:
# A (pre-built) PyTerrier index with Stanford Lemmatizer loaded from TIRA
index_lemmatizer = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (pyterrier-stanford-lemmatizer)', pt_dataset)

bm25_lemmatizer = pt.BatchRetrieve(index_lemmatizer, wmodel="BM25") %3 >> pt.text.get_text(pt_dataset, "text")

In [16]:
bm25_porter.search('booking experience')

Unnamed: 0,qid,docid,docno,rank,score,query,text
0,1,91210,2008.ecir_conference-2008.23,0,15.588513,booking experience,Book Search Experiments: Investigating IR Methods for the Indexing and Retrieval of Books
1,1,87217,2016.clef_conference-2016w.111,1,14.872849,booking experience,"SBS 2016 : Combining Query Expansion Result and Books Information Score for Book Recommendation\n\n\n In this paper, we present our contribution in Suggestion Track at the Social Book Search Lab. This track aims to develop test collections for evaluating ranking effectiveness of book retrieval and recommender systems. In our experiments, we combine the results of Sequential Dependence Model (SDM) and the books information that includes the price, the number Of P ages and the publication Date. We also expand topics' queries by the similar books information to improve the recommendation performance."
2,1,86121,2014.clef_conference-2014w.50,2,14.761358,booking experience,"A Methodology for Social Book Search\n\n\n A general overview of our methodology and results for the INEX 2014 Social Book Search Suggestion Task are presented in this paper. This is our first entry in the Social Book Search Track, which started in 2011. Our methodology and experiments are inspired by background research on the Social Book Search Track [5, 6, 7, 8, and 9]. We originally submitted six runs to the INEX 2014 competition and subsequently expanded our experiments as time allowed. Results, though preliminary, indicate some positive directions for future examination."


In [14]:
bm25_lemmatizer.search('booking experience')

Unnamed: 0,qid,docid,docno,rank,score,query,text
0,1,108896,2016.wwwconf_conference-2016c.14,0,25.016207,booking experience,"Travel the World: Analyzing and Predicting Booking Behavior using E-mail Travel Receipts\n\n\n ABSTRACTTourism industry has grown tremendously in the previous several decades. Despite its global impact, there still remain a number of open questions related to better understanding of tourists and their habits. In this work we analyze the largest data set of travel receipts considered thus far, and focus on exploring and modeling booking behavior of online customers. We extract useful, actionable insights into the booking behavior, and tackle the task of predicting the booking time. The presented results can be directly used to improve booking experience of customers and optimize targeting campaigns of travel operators."
1,1,96355,2010.cikm_conference-2010.293,1,24.641113,booking experience,Experiences with using SVM-based learning for multi-objective ranking\n\n\n ABSTRACTWe describe our experiences in applying learning-to-rank techniques to improving the quality of search results of an online hotel reservation system. The search result quality factors we use are average booking position and distribution of margin in topranked results. (We expect that total revenue will increase with these factors.) Our application of the SVMRank technique improves booking position by up to 25% and margin distribution by up to 14%.
2,1,7548,2022.acl-long.73,2,18.385665,booking experience,"Where to Go for the Holidays: Towards Mixed-Type Dialogs for Clarification of User Goals\n\n\n Most dialog systems posit that users have figured out clear and specific goals before starting an interaction. For example, users have determined the departure, the destination, and the travel time for booking a flight. However, in many scenarios, limited by experience and knowledge, users may know what they need, but still struggle to figure out clear and specific goals by determining all the necessary slots. * Equal contribution. † Mainly responsible for dataset collection during his internship at Baidu."
