# Dataset

Download the dataset from InsideAirbnb for New York City and build an inverted index using Whoosh.
The function `populate_index` takes care of all. It requires the directory name where the index will be stored, usually it's `index` and
it's located in the current working directory.

`.gitignore` tells git to ignore the top level subdirectory `index`, where the inverted index for this project should be stored.

Once you download and generate the II, you don't need to rebuild it anymore and you can directly load it using Whoosh open_in.

In [1]:
from placerank.dataset import populate_index

Download dataset source and build an index. Currently `["id", "name", "description", "neighborhood_overview"]` will be indexed, to add keys to the indexing edit the II schema and add keys to the function `placerank.dataset.DocumentLogicView`.

In [2]:
populate_index("index")

## Search

To search in the inverted index, we have to open it first.

In [3]:
from whoosh.index import open_dir

ix = open_dir("index")

Two additional objects are required to perform queries: a `Searcher` and a `QueryParser`. Their names are pretty self-explanatory.

In [4]:
from whoosh.qparser import QueryParser

parser = QueryParser("neighborhood_overview", ix.schema)

This query parser will search for terms in the "neighborhood_overview" field only. Then display the results of the query.

In [10]:
UIN = "Manhattan"

query = parser.parse(UIN)
with ix.searcher() as searcher:
    results = searcher.search(query)
    print(*[(hit.get("id"), hit.get("name")) for hit in results], sep='\n')

('21421657', 'Rental unit in New York · ★4.89 · 1 bedroom · 2 beds · 1 bath')
('32403772', 'Rental unit in New York · ★4.67 · 1 bedroom · 1 bed · 1 bath')
('11619832', 'Rental unit in New York · ★4.49 · 1 bedroom · 2 beds · 1 shared bath')
('22805946', 'Rental unit in New York · ★3.77 · 2 bedrooms · 2 beds · 1 bath')
('43317941', 'Rental unit in Queens · 1 bedroom · 1 bed · 1 shared bath')
('746284414276369998', 'Rental unit in New York · 1 bedroom · 1 bed · 1 bath')
('3018031', 'Rental unit in New York · Studio · 1 bed · 1 bath')
('6558125', 'Rental unit in New York · ★4.69 · 1 bedroom · 1 bed · 1 bath')
('51714701', 'Rental unit in New York · ★4.86 · 1 bedroom · 1 bed · 1 bath')
('756685458967490030', 'Condo in New York · ★4.77 · 2 bedrooms · 2 beds · 2 baths')


In [11]:
from placerank.dataset import load_page

ID = 6558125

load_page(6558125)

{'id': '6558125',
 'name': 'Rental unit in New York · ★4.69 · 1 bedroom · 1 bed · 1 bath',
 'room_type': 'Entire home/apt',
 'description': '',
 'neighborhood_overview': 'You are in the heart of Manhattan!'}