# Dataset

Download the dataset from InsideAirbnb for New York City and build an inverted index using Whoosh.
The function `populate_index` takes care of all. It requires the directory name where the index will be stored, usually it's `index` and
it's located in the current working directory.

`.gitignore` tells git to ignore the top level subdirectory `index`, where the inverted index for this project should be stored.

Once you download and generate the II, you don't need to rebuild it anymore and you can directly load it using Whoosh open_in.

In [1]:
from placerank.dataset import populate_index

Download dataset source and build an index. Currently `["id", "name", "description", "neighborhood_overview"]` will be indexed, to add keys to the indexing edit the II schema and add keys to the function `placerank.dataset.DocumentLogicView`.

In [2]:
URL = "http://data.insideairbnb.com/united-states/ma/cambridge/2023-12-26/data/listings.csv.gz"
LOCAL_COPY = "datasets/cambridge_listings.csv"
INDEX_DIR = "index"

populate_index(INDEX_DIR, LOCAL_COPY, URL)

## Search

To search in the inverted index, we have to open it first.

In [3]:
from whoosh.index import open_dir

ix = open_dir("index")

Two additional objects are required to perform queries: a `Searcher` and a `QueryParser`. Their names are pretty self-explanatory.

In [4]:
from whoosh.qparser import QueryParser

parser = QueryParser("neighborhood_overview", ix.schema)

This query parser will search for terms in the "neighborhood_overview" field only. Then display the results of the query.

In [5]:
UIN = "cozy"

query = parser.parse(UIN)
with ix.searcher() as searcher:
    results = searcher.search(query)
    print(*[(hit.get("id"), hit.get("name")) for hit in results], sep='\n')

('3678159', 'Rental unit in Cambridge · ★4.82 · 1 bedroom · 1 bed · 1 shared bath')
('53574737', 'Rental unit in Cambridge · ★4.92 · 2 bedrooms · 2 beds · 1.5 baths')
('918749466220965200', 'Rental unit in Cambridge · ★4.88 · 3 bedrooms · 4 beds · 2.5 baths')
('39896724', 'Rental unit in Cambridge · ★4.87 · 3 bedrooms · 3 beds · 1 bath')
('6363476', 'Rental unit in Cambridge · ★4.99 · 2 bedrooms · 3 beds · 1 bath')
('30428472', 'Home in Cambridge · ★4.92 · 1 bedroom · 2 beds · 1 bath')


In [14]:
from placerank.dataset import load_page

ID = "53574737"

load_page(LOCAL_COPY, ID)

{'id': '53574737',
 'name': 'Rental unit in Cambridge · ★4.92 · 2 bedrooms · 2 beds · 1.5 baths',
 'room_type': 'Entire home/apt',
 'description': '',
 'neighborhood_overview': 'Just around the corner from Harvard University, Mt Auburn Hospital, restaurants, bars, and coffee shops, with great access to public transportation; This cozy basement apartment is fully furnished with comfortable, modern accents. The Unit is located on Mt Auburn Street Between Longfellow & Channing. This is in a safe, friendly, clean, and vibrant neighborhood. Minutes walk from the Charles River, Harvard Sq., Redline, Harvard University, Mt Auburn Hospital, and Huron Village.'}

### Sentiment

In [63]:
from sentimentModule.sentiment import GoEmotionsClassifier
classifier = GoEmotionsClassifier()

UIN = "beautiful, quiet and peaceful apartment"

query = parser.parse(UIN)
with ix.searcher() as searcher:
    results = searcher.search(query)
    sentiment = classifier.classify_texts(str(query))
    print(*[hit.get("name", sentiment) for hit in results], sentiment, sep='\n')

Rental unit in New York · ★4.71 · 1 bedroom · 1 bed · 1 bath
Rental unit in New York · 1 bedroom · 3 beds · 1 bath
Home in Brooklyn · ★4.67 · 1 bedroom · 1 bed · 1 shared bath
Serviced apartment in New York · ★4.81 · 3 bedrooms · 4 beds · 2 baths
Rental unit in Staten Island  · ★4.88 · 1 bedroom · 1 bed · 1.5 shared baths
[[{'label': 'admiration', 'score': 0.991719126701355}, {'label': 'neutral', 'score': 0.5229455828666687}]]
