Tripadvisor Crawler

This project contains python source code to crawl hotel reviews for tripadvisor.ie

How to Run:

First time setup python with virtualenv

virtualenv --python=python2.7 .
source bin/activate
pip install -r requirements.txt

Run Hotel URL scrapper

scrapy crawl hspider

Run review scrapper

scrapy crawl taspider -a urls=result/urls.txt

Warning

as of 59d7be388bb1592b2f9b2e5ddc787ec6d3eacf5c the urls.txt and *.jl results are set to append mode. Do be careful running multiple times of crawler because the result will be appended instead of overwritten. It is advised at the moment that you manually checkpoint and save scrapped data to prevent duplicated result.

Legacy Content

From the terminal, go the the scrapers/tripadvisor/tripadvisorCrawler directory

Run the following command:

    scrapy crawl taspider -a urls=/path/to/file

Remarks

Clearlake Hotel

testing hotel 1, 22 reviews, can get 19 reviews, 3 reviews are from google translate (can not get it)

Palmerstown Lodge

testing hotel 2, 31 reviews, can get 19 reviews, a few reviews are from google translate, and a few of others also can not get for some reason for example, when download the first page, only can get 6 reviews, but if check form browser, there is 10 reviews there, other four review might from its partner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tripadvisor Crawler

How to Run:

First time setup python with virtualenv

Run Hotel URL scrapper

Run review scrapper

Warning

Legacy Content

Remarks

Clearlake Hotel

Palmerstown Lodge

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tripadvisor Crawler

How to Run:

First time setup python with virtualenv

Run Hotel URL scrapper

Run review scrapper

Warning

Legacy Content

Remarks

Clearlake Hotel

Palmerstown Lodge