# IR Lab WiSe 2023: Stopword Lists

This tutorial shows how to configure and use custom stopword lists in PyTerrier.

**Attention:** The scenario below is cherry-picked to explain the concept of stopword lists with a minimal example.


## Preparation: Install dependencies

In [None]:
!pip3 install python-terrier



## Our Scenario

We want to build a search engine to support web developers working with CSS.

Our search engine has the following 3 documents:



In [None]:
import pandas as pd

documents = pd.DataFrame([
    {'docno': 'd1', 'text': 'In CSS, ::before creates a pseudo-element that is the first child of the selected element.'},
    {'docno': 'd2', 'text': 'In CSS, ::after creates a pseudo-element that is the last child of the selected element.'},
    {'docno': 'd3', 'text': 'The ::first-line CSS pseudo-element applies styles to the first line of a block-level element.'}
])

We create an index containing our three documents and use BM25 as retrieval model:

In [None]:
import pyterrier as pt
if not pt.started():
    pt.init()

def create_index(df):
    indexer = pt.DFIndexer("./index", overwrite=True)
    index_ref = indexer.index(df["text"], df["docno"])
    return pt.IndexFactory.of(index_ref)

index = create_index(documents)

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

  for column, value in meta_column[1].iteritems():


In [None]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

## The Problem

Our search engine is now in operation for a while and we already received some positive feedback.
Still, we received some mails that complained that our search engine retrieves no relevant results for queries like "before", "css before", "after", and "CSS after".

We wonder why this is the case, because our index contains relevant documents for all four mentioned queries: (1) document `d1` is relevant for queries like "before" and "css before", and (2) document `d2` is relevant for queries like "after", and "CSS after".

Lets look into the problem:

In [None]:
# searching for before returns no results
bm25.search("before")

Unnamed: 0,docid,docno,rank,score,qid,query


In [None]:
# searching for after returns no results
bm25.search("after")

Unnamed: 0,docid,docno,rank,score,qid,query


In [None]:
# searching for css before returns results, but the relevant document d2 is only on the last position
bm25.search("css after")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,2,d3,0,-2.513562,css after
1,1,0,d1,1,-2.981605,css after
2,1,1,d2,2,-2.981605,css after


## The Solution

After some more debugging, we found out that [after](https://github.com/terrier-org/terrier-desktop/blob/master/share/stopword-list.txt#L206) and [before](https://github.com/terrier-org/terrier-desktop/blob/master/share/stopword-list.txt#L332) are both on the default stopword list and therefore removed from the documents and index.

To address this problem systematically, we create a small Cranfield-Style collection to measure if our bugfixes to the stopword list improve the retrieval:

In [None]:
# The information needs that we want to test
topics = pd.DataFrame([
    {'qid': '1', 'query': 'before'},
    {'qid': '2', 'query': 'css before'},
    {'qid': '3', 'query': 'after'},
    {'qid': '4', 'query': 'CSS after'},
])

qrels = pd.DataFrame([
    {'qid': '1', 'docno': 'd1', 'relevance': 1}, #d1 is the only relevant document for query 1
    {'qid': '2', 'docno': 'd1', 'relevance': 1}, #d1 is the only relevant document for query 2
    {'qid': '3', 'docno': 'd2', 'relevance': 1}, #d2 is the only relevant document for query 3
    {'qid': '3', 'docno': 'd2', 'relevance': 1}, #d2 is the only relevant document for query 3
])

In [None]:
pt.Experiment([bm25], topics, qrels, eval_metrics=['ndcg_cut_3', 'P_1'])

Unnamed: 0,name,ndcg_cut_3,P_1
0,BR(BM25),0.166667,0.0


Alright, now that we can measure the effectiveness, lets try to improve the effectiveness.

By thinking about the problem, we came to the conclusion that our stopword list should contain terms like "the", but also "css" should be a stopword because all our documents are on CSS.

To implement this, we store the terms the and css in a file called custom-stopwords.txt and we configure the `stopwords.filename` property of pyterrier so that our new stopword list is used.

In [None]:
!echo -e "the\ncss" > custom-stopwords.txt
#check the content of the stopword list:
!cat custom-stopwords.txt

the
css


In [None]:
# we use our new stopword list
pt.set_property("stopwords.filename", "./custom-stopwords.txt")

# we create a new index
index = create_index(documents)
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

  for column, value in meta_column[1].iteritems():


In [None]:
pt.Experiment([bm25], topics, qrels, eval_metrics=['ndcg_cut_3', 'P_1'])

Unnamed: 0,name,ndcg_cut_3,P_1
0,BR(BM25),1.0,1.0


# Summary

We made quite some improvement with adjusting our stopword list (nDCG@3 improved from 0.167 to 1.0 and Precision@1 improved from 0.0 to 1.0).

To summarize everything, please answer the following three questions:

### Question 1: Is Stopword-removal a Precision-Oriented or a Recall-Oriented technique?

Example Solution: Stopword Removal is precision-oriented because fewer documents are retrieved.

### Question 2: There are many Different Stopword lists out there, please find 3 stopword lists and skim over them, do you spot obvious differences or surprising terms?

Example Solution: Everything is a correct answer. Possible observations might be "the", "be", ... are overlapping stopwords, but in general stop word lists are often very different. Maybe pointing to the paper that showed that stopword lists have a quite substantial impact on effectiveness.

### Question 3: Do you know famous phrases that are challenging to retrieve when we apply stopword removal?

Example Solution: `to be or not to be`