# Retrieval

## Models

Performance of the IR system is graded with different combinations of query matching models and IR models.

Different preprocessing techniques are investigated.

In [None]:
# The "index" directory will contain different versions of the index as subdirectories.
# Make the directory if not created already.

import os
os.mkdir("index")

In [1]:
from placerank.dataset import populate_index
from placerank.preprocessing import ANALYZER_NAIVE, ANALYZER_STEMMER, ANALYZER_LEMMATIZER

## Create an index using the naive analyzer.

Text in the naive analyzer is processed through this pipeline:

- tokenization

- conversion to lowercase

- stop words removal

In [None]:
populate_index("index/naive", ANALYZER_NAIVE)

## Stemmer analyzer

This preprocessing adds a stage to the previous pipeline. Words are stemmed.

Whoosh StemFilter uses Porter's Stemmer algorithm and extracts root from words.

In [None]:
populate_index("index/stem", ANALYZER_STEMMER)

## Lemmatizer Analyzer

This preprocessing adds a lemmatization step to the first pipeline, but stop words removal is performed after the lemmatization.

In [2]:
populate_index("index/lemma", ANALYZER_LEMMATIZER)

## Crisp sets

The crisp set model is a set theoretical model based on the classic set operations.
The theoretical framework provides a membership function that tells if each document is relevant to the query or not.

A document is relevant if and only if it matches completely the query (i.e. contains every term for simple "AND" queries).

Open the inverted index processed using the naive pipeline.

In [1]:
from whoosh.index import open_dir
from whoosh.qparser import MultifieldParser

ix = open_dir("index/naive")
parser = MultifieldParser(fieldnames=ix.schema.logicview.keys(), schema=ix.schema)

In [5]:
from whoosh.searching import Searcher
from whoosh import scoring

In [7]:
QUERY = "flat in manhattan"

q = parser.parse(QUERY)

with ix.searcher() as searcher:
    res = searcher.search(q, scored=False, sortedby=None)
    print(*[r for r in res], sep="\n")

<Hit {'id': '668934945181265366', 'name': 'Serviced apartment in New York · 1 bedroom · 1 bed · 1 bath', 'room_type': 'Entire home/apt'}>
<Hit {'id': '669491267862682923', 'name': 'Serviced apartment in New York · ★4.0 · Studio · 1 bed · 1 bath', 'room_type': 'Entire home/apt'}>
<Hit {'id': '963789422666124992', 'name': 'Serviced apartment in New York · Studio · 1 bed · 1 bath', 'room_type': 'Entire home/apt'}>
<Hit {'id': '43319281', 'name': 'Rental unit in New York · ★5.0 · 1 bedroom · 1 bed · 1 bath', 'room_type': 'Entire home/apt'}>
<Hit {'id': '50702550', 'name': 'Rental unit in New York · ★4.55 · 1 bedroom · 2 beds · 1 shared bath', 'room_type': 'Private room'}>
<Hit {'id': '6969011', 'name': 'Townhouse in Brooklyn · ★4.77 · 1 bedroom · 1 bed · 1.5 baths', 'room_type': 'Private room'}>
<Hit {'id': '51019131', 'name': 'Rental unit in New York · ★4.59 · 1 bedroom · 2 beds · 1 shared bath', 'room_type': 'Private room'}>
<Hit {'id': '1029317224873552477', 'name': 'Rental unit in New 

## Vector-space model

### Binary Indipendence Model