CX4242 project: Airbnb vs Hotels
Our CX4242 project focuses on quantifying differences between Airbnb listings and hotels in the city of New York. The entire project consists of several different components:
- Data collection and scraping: We collected data from several sources, notably Airbnb, Amadeus (a travel IT company with an API for booking/pricing information), TripAdvisor, and OpenStreetMap.
- NLP analysis on reviews: We used the Stanford Core NLP library to segment reviews and perform sentiment analysis.
- Search engine: We compiled all Airbnb and hotel data into an ElasticSearch instance hosted on AWS, to be able to search across both datasets at once.
- Visualization UI: We summarized all of the data and analyses through an interactive webpage.
Our finalized datasets are stored in an AWS ElasticSearch instance, and our site is hosted with AWS ElasticBeanstalk.
Our project uses Python 3.5.
To install all Python dependencies used in this project, run
pip install -r requirements.txt
Execution - Instructions for Recreating our Project
Running the TripAdvisor scraper
First, use the appropriate repository by doing
base_spider.py- This spider gets the necessary URLs (through TripAdvisor's autocomplete) for each city that we are searching for. This only needs to be run once, and it outputs to
listings_spider.py- Uses the URLs from the previous part to crawl for listings. Run with
scrapy crawl listings -o listings.json
hotels_spider.py- Scrapes hotel amenities for each listing. Run with
scrapy crawl hotels -o amenities.json
amenities.jsoncontain price, amenities, and some other basic information for the TripAdvisor search results.
reviews_spider.py- Scrapes review text for each listing obtained from the listings spider. Run with
scrapy crawl reviews -a filename=<filename>, where the file is a CSV with TripAdvisor URLs for each hotel.
Running scripts to collect data from Amadeus
Scripts for collecting data from the Amadeus API are in the
amadeus-api folder. In order to access the Amadeus API, sign for an API key. Then, set an environment variable for this key.
\\ On Unix-based systems: export AMADEUS_KEY='your api key here' \\ On Windows: setx AMADEUS_KEY "your api key here"
We wanted to merge data from both TripAdvisor and Amadeus.
search.py- This script searches for hotels in Amadeus based off of the coordinates of hotels we've already scraped from TripAdvisor.
recordPrices.py- This script searches each hotel for prices across a range of dates.
cd amadeus-api python search.py python recordPrices.py
Downloading Basic Airbnb Listing and Reviews Data
The Airbnb listings are from Inside Airbnb.
Scraping Airbnb prices over time
data/scrape_airbnb_prices.pyscrapes Airbnb prices for given listings on given dates.
To see some example data that we scraped/collected/merged, see the data folder.
Add data to ElasticSearch
In AWS, create a new ElasticSearch instance, with indices for
hotels (all hotel data),
airbnbs (Airbnb listing data), and
airbnb_prices (Airbnb temporal data). See the data folder for more details about uploading.
Sentiment Analysis on Reviews
See the reviews analysis folder for more details.
Running the Web App
To run the web application, first set environment variables for ElasticSearch access keys:
export ES_KEY='your key ID here' // or setx ES_KEY "key ID" in Windows export ES_SECRET='your secret here' // or setx ES_SECRET "secret" in Windows
Then, start the web application.
cd flask-app python application.py
localhost:8000 in your browser to see the site.