TRIPADVISOR SCRAPY CRAWLER

Usage

Main csvs

First step is to obtain initial csvs. Use for this the geoRestaurant spider to crawl an entire list of restaurant from a geographic page of tripadvisor (or similar) about restaurants such this one:

...to have your csv you need to go in the root folderof the crawler ad run from terminal:

scrapy crawl geoRestaurant -a url="http://www.tripadvisor.com/yourpageurl.html" -O "your/output/file.csv"

Reviews

Once you have your csv, you can scrape all reviews of all the restaurants collected in your cvs. The file getAllReviews.py can be runned by command line. You need to specify as argument the name of the csv fiel you want to iterate.

pytyon getAllReviews.py -a csv="your/csvfile.csv"

Across all the link on the csv the spider named reviewsRestaurant will collect all review for that restaurant and wirting those in json file inside a folder calles as the initial csv. It will take a long time probably, accordingly to the csv size and the number of reviews of each restaurant.

N.B Can be very useful modifying getAllReviews.py basing on what you need.

Enrichment

To scrape more information from the main page of each restaurant, another spider is needed. So use Restaurants to iterate once over a specific csv , obtaining a complementary version for this csv with information not present on the first list explored. So csv is completeed with information such as: specialDiets, covidMeasure(boolean), geographical coordinates, and more geographical information useful to complete and correct the ones alredy owned.

scrapy crawl Restaurants -a csv="your/originalfile.csv" -O "your/newfile.csv"

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
community_discovery		community_discovery
cover		cover
data		data
data_collection		data_collection
link_prediction		link_prediction
network_analysis		network_analysis
network_building		network_building
open_question		open_question
other		other
.gitignore		.gitignore
README.md		README.md
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRIPADVISOR SCRAPY CRAWLER

CONTENTS

Spiders

Others

Usage

Main csvs

Reviews

Enrichment

About

Releases

Packages

Languages

micheleandreucci/Social-Network-Analysis-project

Folders and files

Latest commit

History

Repository files navigation

TRIPADVISOR SCRAPY CRAWLER

CONTENTS

Spiders

Others

Usage

Main csvs

Reviews

Enrichment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages