# Retrieval

## Models

Performance of the IR system is graded with different combinations of query matching models and IR models.

Different preprocessing techniques are investigated.

In [1]:
# The "index" directory will contain different versions of the index as subdirectories.
# Make the directory if not created already.

import os
os.mkdir("index")

In [2]:
from placerank.dataset import populate_index
from placerank.preprocessing import ANALYZER_NAIVE, ANALYZER_STEMMER, ANALYZER_LEMMATIZER

## Create an index using the naive analyzer.

Text in the naive analyzer is processed through this pipeline:

- tokenization

- conversion to lowercase

- stop words removal

In [4]:
populate_index("index/naive", ANALYZER_NAIVE)

## Stemmer analyzer

This preprocessing adds a stage to the previous pipeline. Words are stemmed.

Whoosh StemFilter uses Porter's Stemmer algorithm and extracts root from words.

In [6]:
populate_index("index/stem", ANALYZER_STEMMER)

## Lemmatizer Analyzer

This preprocessing adds a lemmatization step to the first pipeline, but stop words removal is performed after the lemmatization.

In [7]:
populate_index("index/lemma", ANALYZER_LEMMATIZER)

## Crisp sets

The crisp set model is a set theoretical model based on the classic set operations.
The theoretical framework provides a membership function that tells if each document is relevant to the query or not.

A document is relevant if and only if it matches completely the query (i.e. contains every term for simple "AND" queries).

Open the inverted index processed using the naive pipeline.

In [1]:
from whoosh.index import open_dir

naive_ix = open_dir("index/naive")
stem_ix = open_dir("index/stem")
lemma_ix = open_dir("index/lemma")

In [2]:
from placerank.search import boolean_search
import ipywidgets as widgets

In [3]:
queryField = widgets.Text(
    value="flat in manhattan",
    description="Query:"
)
display(queryField)

Text(value='flat in manhattan', description='Query:')

In [7]:
res = boolean_search(naive_ix, queryField.value)
print(*[r for r in res], sep="\n")

14528
30049
15843
15974
6727
10248
14665
8266
843
19560
28556
7410
12787
16151


Let's see the effect of stemming on the query.

In [8]:
res = boolean_search(stem_ix, queryField.value)
print(*[r for r in res], sep="\n")

14528
30049
15843
15974
6727
10248
14665
8266
843
19560
28556
7410
12787
16151


Now what's different using lemmatization instead of stemming.

In [9]:
res = boolean_search(lemma_ix, queryField.value)
print(*[r for r in res], sep="\n")

14528
30049
15843
15974
6727
10248
14665
8266
843
19560
28556
7410
12787
16151


## Vector-space model

Comparison of different retrieval techniques for the default index, which has been processed with the default analyzer.

In [4]:
from placerank.search import vector_boolean_search, vector_tfidf_search, vector_bm25_search
from whoosh.index import open_dir
import ipywidgets as widgets

ix = open_dir("index/lemma")

In [9]:
queryField = widgets.Text(
    value="flat in manhattan",
    description="Query:"
)
display(queryField)

Text(value='flat in manhattan', description='Query:')

In [8]:
print(*vector_boolean_search(ix, queryField.value))
print(*vector_tfidf_search(ix, queryField.value))
print(*vector_bm25_search(ix, queryField.value))




