# Master Data Science for Business - Data Science Consulting - Session 2 

### Notebook 3: Web Scraping with Scrapy: Getting reviews from TripAdvisor

<u>Context</u>: This notebook was originally created by Capgemini. We adapted it to scrap data on the center Parc "Le Lac d'Ailette" (on TripAdvisor).

<u>The main issue we faced when running this notebook</u>: changing the URL of the  listing to scrap triggered several errors. For instance, some tags were different, and we could not scrap the page number - which we use to stop the script. Also, we only managed to scrap 663 reviews. Not sure why, but this number corresponds to the total number of reviews in english (yet, we also have reviews in French...).

## 1. Importing packages

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import sys
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
import json
import logging
import pandas as pd

## 2. Some class and functions

In [2]:
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

class HotelreviewsItem(scrapy.Item):
    # define the fields for your item here like:
    rating = scrapy.Field()
    review = scrapy.Field()
    title = scrapy.Field()
    trip_date = scrapy.Field()
    trip_type = scrapy.Field()
    published_date = scrapy.Field()
    image_url = scrapy.Field()
    hotel_type = scrapy.Field()
    hotel_name = scrapy.Field()
    hotel_adress = scrapy.Field()
    price_range = scrapy.Field()
    reviewer_id = scrapy.Field()
    review_id = scrapy.Field()
    review_language = scrapy.Field()
    pid = scrapy.Field()
    locid = scrapy.Field()

In [3]:
def user_info_splitter(raw_user_info):
    """

    :param raw_user_info:
    :return:
    """

    user_info = {}

    splited_info = raw_user_info.split()
    for element in splited_info:
        converted_element = get_convertible_elements_as_dic(element)
        if converted_element:
            user_info[converted_element[0]] = converted_element[1]

    return user_info

## 2. Creating the JSon pipeline 

In [4]:
#JSon pipeline, you can rename the "trust.jl" to the name of your choice
class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('tripadvisor2.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

## 3. Spider

Now you know how to get data from one page, we want to automate the spider so it will crawl through all pages of reviews, ending with a full spider able to scrape every reviews of the selected parc. You will modify here the parse function since this is where you tell the spider to get the links and to follow them. <br>
<b>To Do</b>: Complete the following code, to scrape all the reviews of one parc. 

In [5]:
import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return(cleantext)

In [6]:
class MySpider(CrawlSpider):
    name = 'BasicSpider'
    domain_url = "https://www.tripadvisor.com"
    # allowed_domains = ["https://www.tripadvisor.com"]

    start_urls = ["https://www.tripadvisor.fr/Hotel_Review-g1572451-d775381-Reviews-Center_Parcs_Le_Lac_d_Ailette-Chamouille_Aisne_Hauts_de_France.html"]
    
    #Custom settings to modify settings usually found in the settings.py file 
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'tripadvisor2.json'                       # Used for pipeline 2
    }

    def parse(self, response):

        open_span = '<span class="pageNum current disabled">'
        close_span = '</span>'
        
        all_review_pages = response.xpath("//a[contains(@class,'pageNum') and contains(@class,'last')]/@data-offset").extract()

        next_reviews_page_url = "https://www.tripadvisor.com" + response.xpath("//a[contains(@class,'nav') and contains(@class,'next') and contains(@class,'primary')]/@href").extract_first()

        current_page_number = int(cleanhtml(response.xpath('//*[contains(@class,"current") and contains(@class, "pageNum")]').extract_first()))
        last_page_number = int(cleanhtml(response.xpath('//*[contains(@class,"last") and contains(@class, "pageNum")]').extract_first()))

        # Scrap all pages
        if current_page_number < last_page_number:
            yield scrapy.Request(next_reviews_page_url, callback=self.parse)

        review_urls = []
        for partial_review_url in response.xpath("//div[contains(@class,'quote')]/a/@href").extract():
            review_url = response.urljoin(partial_review_url)
            if review_url not in review_urls:
                review_urls.append(review_url)

            yield scrapy.Request(review_url, callback=self.parse_review_page)

    def parse_review_page(self, response):

        item = HotelreviewsItem()

        item["reviewer_id"] = next(iter(response.xpath(
            "//div[contains(@class,'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div/div[contains(@class,'prw_reviews_user_links_hs')]/span/@data-memberid").extract()),
                                   None)
        item["review_language"] = next(iter(response.xpath(
            "//div[contains(@class,'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div/div[contains(@class,'prw_reviews_user_links_hs')]/span/@data-language").extract()),
                                       None)
        item["review_id"] = next(iter(response.xpath(
            "//div[contains(@class,'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div/div[contains(@class,'prw_reviews_user_links_hs')]/span/@data-reviewid").extract()),
                                 None)
        item["review_id"] = next(iter(response.xpath(
            "//div[contains(@class,'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div/div[contains(@class,'prw_reviews_user_links_hs')]/span/@data-reviewid").extract()),
                                 None)
        item["pid"] = next(iter(response.xpath(
            "//div[contains(@class,'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div/div[contains(@class,'prw_reviews_user_links_hs')]/span/@data-pid").extract()),
                           None)
        item["locid"] = next(iter(response.xpath(
            "//div[contains(@class,'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div/div[contains(@class,'prw_reviews_user_links_hs')]/span/@data-locid").extract()),
                             None)

        review_id = item["review_id"]
        review_url_on_page = response.xpath('//script[@type="application/ld+json"]/text()').extract()
        review = eval(review_url_on_page[0])

        item["review"] = review["reviewBody"].replace("\\n", "")
        item["title"] = review["name"]
        item["rating"] = review["reviewRating"]["ratingValue"]
        item["image_url"] = review["image"]
        item["hotel_type"] = review["itemReviewed"]["@type"]
        item["hotel_name"] = review["itemReviewed"]["name"]
        item["price_range"] = review["itemReviewed"]["priceRange"]
        item["hotel_adress"] = review["itemReviewed"]["address"]
        try:
            item["published_date"] = review["datePublished"]
        except KeyError:

            item["published_date"] = next(iter(response.xpath(
                f"//div[contains(@id,'review_{review_id}')]/div/div/span[@class='ratingDate']/@title""").extract()),
                                          None)

        item["trip_type"] = next(iter(response.xpath("//div[contains(@class,"
                                                     "'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div/div"
                                                     "/div/div/div[contains(@class,'noRatings')]/text()").extract()),
                                 None)

        try:
            item["trip_date"] = next(iter(response.xpath("//div[contains(@class,"
                                                         "'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div["
                                                         "contains(@class,'prw_reviews_stay_date_hsx')]/text()").extract(

            )), None)

        except:

            item["trip_date"] = next(iter(response.xpath(
                "//div[contains(@id,'review_538163624')]/div/div/div[@data-prwidget-name='reviews_stay_date_hsx']/text()").extract()),
                                     None)

        # user_info = response.xpath("//div[contains(@class,'prw_reviews_resp_sur_h_featured_review')]/div/div/div/div/div[contains(@class,'prw_reviews_user_links_hs')]").extract()[0]
        # item["unstructured"] = user_info_splitter(user_info)

        yield item


## 4. Crawling

In [7]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start()

2019-01-24 15:34:15 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-01-24 15:34:15 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.2.2, Platform Darwin-18.2.0-x86_64-i386-64bit
2019-01-24 15:34:15 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'tripadvisor2.json', 'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


## 5. Importing and reading data scraped

If you've succeeded, you should see here a dataframe with 248 entries corresponding to the 248 reviews of the Center Parc you scraped. Congratulations ! 

In [8]:
dfjson = pd.read_json('tripadvisor2.json')
#Previewing DF
dfjson.head()

Unnamed: 0,hotel_adress,hotel_name,hotel_type,image_url,locid,pid,price_range,published_date,rating,review,review_id,review_language,reviewer_id,title,trip_date,trip_type
0,"{'@type': 'PostalAddress', 'streetAddress': '1...",Center Parcs Le Lac d'Ailette,LodgingBusiness,https://media-cdn.tripadvisor.com/media/photo-...,775381,38673,115€ - 267€ (Selon les tarifs moyens d'une ch...,27 septembre 2018,4,Apr\u00e8s une d\u00e9sastreuse aventure au Bo...,619991006,fr,A70008728AC6B6105263C31212876BAB,tr\u00e8s bon week end,août 2018,
1,"{'@type': 'PostalAddress', 'streetAddress': '1...",Center Parcs Le Lac d'Ailette,LodgingBusiness,https://media-cdn.tripadvisor.com/media/photo-...,775381,38673,115€ - 267€ (Selon les tarifs moyens d'une ch...,18 janvier 2019,5,"Ambiance d\u00e9tendue , une vraie d\u00e9conn...",646844882,fr,E1D229008CB104D9A0B50E1C28400729,"S\u00e9jour agr\u00e9able comme toujours ,une ...",mars 2018,A voyagé en famille
2,"{'@type': 'PostalAddress', 'streetAddress': '1...",Center Parcs Le Lac d'Ailette,LodgingBusiness,https://media-cdn.tripadvisor.com/media/photo-...,775381,38673,115€ - 267€ (Selon les tarifs moyens d'une ch...,11 novembre 2018,3,Premi\u00e8re fois que nous allions \u00e0 cen...,632573813,fr,56719A188FFAF310DB972EF10480F7DD,"3,5 serait plus juste",juillet 2018,A voyagé en couple
3,"{'@type': 'PostalAddress', 'streetAddress': '1...",Center Parcs Le Lac d'Ailette,LodgingBusiness,https://media-cdn.tripadvisor.com/media/photo-...,775381,38673,115€ - 267€ (Selon les tarifs moyens d'une ch...,3 octobre 2018,4,G\u00e9nial pour les enfants petits et grands ...,621674154,fr,05035FA5944AFAFFD00A7A4A4A6331D7,Endroit sympathique,avril 2018,
4,"{'@type': 'PostalAddress', 'streetAddress': '1...",Center Parcs Le Lac d'Ailette,LodgingBusiness,https://media-cdn.tripadvisor.com/media/photo-...,775381,38673,115€ - 267€ (Selon les tarifs moyens d'une ch...,17 janvier 2019,2,Nous avons fait une r\u00e9servation avec notr...,646623017,fr,0304614F0BFFA918916A51FD21624A4E,R\u00e9servation f\u00e9vrier 2019,janvier 2019,A voyagé en famille


In [9]:
dfjson.shape

(663, 16)