# Scraping Hotel Ratings on Booking # 

In this homework we will practice web scraping on the following [site](https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmcgV1c19tYYgBAZgBMcIBA2FibsgBDdgBA-gBAfgBApICAXmoAgQ&sid=28d97f630803f9d48b4a1f535cbdd33f&class_interval=1&dest_id=20061717&dest_type=city&group_adults=2&group_children=0&label_click=undef&no_rooms=1&raw_dest_type=city&room1=A%2CA&sb_price_type=total&src=index&src_elem=sb&ss=Boston&ssb=empty&ssne_untouched=Cancún&rows=15). Let's get some basic information for each hotel in Boston.
On each hotel page, scrape the following information: 
1. Hotel Name
2. Class of Rating (Wonderful/Excellent/Very Good/Good)
3. Rating Score
4. Number of Reviews


** Save the data in "traveler_ratings.csv" in the following format: hotel_name, class_of_rating, rating, num_reviews **

**(10 pts)**

You can see an overview of the information as displayed:





![Information to be scraped](booking_sample.png)

In [2]:
#Imports
from bs4 import BeautifulSoup
import sys
import time
import os
import logging
import argparse
import requests
import codecs
import json
import re
import pandas as pd
from pandas import DataFrame as df
import csv

In [3]:
# EXAMPLE: http://www.booking.com/Boston
base_url = "http://www.booking.com"
url = base_url +"/"+ 'boston'
response = requests.get(url)
html = response.text.encode('utf-8')
soup = BeautifulSoup(html, "lxml")
li = soup.findAll('a', href=True)
for el in li:
    if el.find(text=re.compile('All')) and el.find(text=re.compile('properties in')):
        url = el['href']
url = base_url + url

In [4]:
print("URL of All Hotel Listings in Boston")
city_url = url.strip('\n')
print(city_url)

URL of All Hotel Listings in Boston
http://www.booking.com/searchresults.html?city=20061717


In [6]:
def get_hotellist_page(city_url, count):
    """ Get the hotel list page given the url returned by
        get_city_page(). Return the html after saving
        it to the datadir 
    """     
    # Sleep 2 sec before starting a new http request
    time.sleep(2)
    # Request page
    response = requests.get(city_url)
    html = response.text.encode('utf-8')
    return html

In [315]:
def parse_hotellist_page(url):
    """ Parse the url page returned by the previous step.
        Return the next url page to scrape (a city can have
        more than one page of hotels) if there is, else exit
        the script.
    """
    response = requests.get(url)
    html = response.text.encode('utf-8')
    soup = BeautifulSoup(html,"lxml")
    
    # Extract hotel name, star rating and number of reviews  
    hotel_boxes = soup.findAll('div', {'class' :'sr_item_default'})  
    
    for hotel_box in hotel_boxes:   
        hotel_rating = []
        state = True
        try:
            name = hotel_box.find('span', {'class' :'sr-hotel__name'}).find(text=True).strip()
            link = hotel_box.find('a', {'class' :'hotel_name_link url'})['href'].strip()        
        except Exception as e:
            state = False
            name = ""
            link = ""
            
        if (state):
            try:
                reviews = hotel_box.find('span', {'class' :'review-score-badge'}).find(text=True).strip()
                ratings = hotel_box.find('span', {'class' :'review-score-widget'}).find(text=True).strip()
                no_reviews = hotel_box.find('span', {'class' :'review-score-widget__subtext'}).find(text=True).strip()
                class_rating = hotel_box.find('span', {'class' :'review-score-widget__text'}).find(text=True).strip()
                if class_rating == '':
                    div1 = hotel_box.find('span', {'class' :'review-score-widget__text'})
                    parsestring = str(div1).split('>')
                    class_rating = parsestring[3].split('<')[0].strip()
            except Exception as e:
                reviews = "N/A"
                ratings = "N/A" 
                no_reviews = "N/A"
                class_rating = "N/A"
                
           
        hotel_rating.append([name, class_rating, reviews, no_reviews])
        with open('traveler_ratings.csv', 'a', newline='') as csvfile:
            for line in hotel_rating:
                writer = csv.writer(csvfile)
                writer.writerow(line)


            
    # Get next URL page if exists, else exit
    div = soup.find("div", {"class" : "results-paging"})

    # check if last page
    if div.find('span', {'class' : 'paging-end'}):
        return False
    # If it is not last page there must be the Next URL
    hrefs = div.find_all('a', href= True)

    for href in hrefs:    
        if href.find(text = True) == 'Next page':
            return href['href']

In [316]:
with open('traveler_ratings.csv', 'a') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Hotel Name','Class of Rating', 'Rating Score', 'Number of Reviews'])
c=0
while(city_url):
    c +=1
    city_url = parse_hotellist_page(city_url)
    city_url

Now let's scrape some reviews. For each review of each each hotel in Boston you are to scrape the following attributes: 
1. Reviewer name
2. Reviewer ethnicity
3. Number of reviews 
4. Number of helpful votes
5. Date
6. Rating
7. Negative Review
8. Positive Review

Note that you will also need the hotel's name!! Also, some reviews may not have all attributes. 

** Save the data in "review_ratings.csv" in the following format: hotel_name, reviewer_name, ethnicity, num_reviews, num_help_votes, date, rating, neg_review, pos_review **

**(25 pts)**

You can see an overview of the information as displayed:
![Information to be scraped](review_sample.png)

In [None]:
def get_attr_hotel(attr_page, url_hotel):

In [38]:
def parse_hotellist_review(url):
    """ Parse the url page returned by the previous step.
        Return the next url page to scrape (a city can have
        more than one page of hotels) if there is, else exit
        the script.
    """
    response = requests.get(url)
    html = response.text.encode('utf-8')
    time.sleep(1)
    soup = BeautifulSoup(html,"lxml")
    
    # Extract hotel name, star rating and number of reviews  
    hotel_boxes = soup.findAll('div', {'class' :'sr_item_default'})  
    
    for hotel_box in hotel_boxes:
        hotel_rev = []
        state = True
        name = hotel_box.find('span', {'class' :'sr-hotel__name'}).find(text=True).strip()
        print('Hotel:',str(name))
        print('*'*100)
        hotel_rev.append([name])
        link = base_url + hotel_box.find('a', {'class' :'hotel_name_link url'})['href'].strip()
        hotel_soup = BeautifulSoup(requests.get(link).text.encode('utf-8'), "lxml")
        if(state):
            try:
                link_all_rev = base_url + hotel_soup.find('a', {'class' :'show_all_reviews_btn'})['href'].strip()
            except Exception as e:
                state = False
        while(state):        
        #print('link:',link_all_rev)
            rev_soup = BeautifulSoup(requests.get(link_all_rev).text.encode('utf-8'), "lxml")
            rev_boxes = rev_soup.findAll('li', {'itemprop' :'review'})
            for rev_box in rev_boxes:
                try:
                    nationality = rev_box.find('span', {'class' :'reviewer_country'}).find('span', {'itemprop' :'name'}).find(text=True).strip()
                    rev_name = rev_box.find('div', {'class' :'review_item_reviewer'}).find('span', {'itemprop' :'name'}).find(text=True).strip()
                    rev_count = rev_box.find('div', {'class' :'review_item_user_review_count'}).find(text=True).strip()
                    rev_date = rev_box.find('p', {'class' :'review_item_date'}).find(text=True).strip()
                    rev_rating = rev_box.find('span', {'class' :'review-score-badge'}).find(text=True).strip()
                    rev_neg = rev_box.find('p', {'class' :'review_neg'}).find('span', {'itemprop' :'reviewBody'}).find(text=True).strip()
                    rev_pos = rev_box.find('p', {'class' :'review_pos'}).find('span', {'itemprop' :'reviewBody'}).find(text=True).strip()                   

                except Exception as e:
                    hotel_name = "N/A"
                    nationality = "N/A"
                    rev_name = "N/A"
                    rev_count = "N/A"
                    rev_date = "N/A"
                    rev_rating = "N/A"
                    rev_neg = "N/A"
                    rev_pos = "N/A"


                hotel_rev.append([name, rev_name, nationality, rev_count, rev_date, rev_rating, rev_neg, rev_pos])
            #hotel_rev.append([rev_name, nationality, rev_count, rev_date, rev_rating])
            with open('ratings_review.csv', 'a', newline='', encoding='utf-8') as csvfile:
                for line in hotel_rev:
                    writer = csv.writer(csvfile)
                    writer.writerow(line)

##########################################################################################################################
                #hrefs = rev_soup.find_all('a', href= True)
            
            hrefs = rev_soup.find('p', {'class' :'page_link review_next_page'})
            #print('length of hrefs:',len(hrefs))
            if (hrefs == None) or (len(hrefs)==1):
                break
            else:
                hrefs = rev_soup.find('p', {'class' :'page_link review_next_page'}).find_all('a', href= True)
                for href in hrefs:
                    link_all_rev = base_url + href['href']
                    print(link_all_rev)

    

##########################################################################################################################           
    # Get next URL page if exists, else exit
    div = soup.find("div", {"class" : "results-paging"})

    # check if last page
    if div.find('span', {'class' : 'paging-end'}):
        return False
    # If it is not last page there must be the Next URL
    hrefs = div.find_all('a', href= True)

    for href in hrefs:    
        if href.find(text = True) == 'Next page':
            return href['href']

In [39]:
with open('ratings_review.csv', 'a') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Hotel Name', 'Reviewer Name','Reviewer Ethnicity', 'Number of Reviews', 'Date', 'Rating', 'Negative Review', 'Positive Review' ])
c=0
while(city_url):
    c +=1
    city_url = parse_hotellist_review(city_url)
    city_url

Hotel: Global Luxury Suites at West End
****************************************************************************************************
Hotel: 102 Chandler #9 By Lyon Apartments
****************************************************************************************************
Hotel: One-Bedroom on Washington Street Apt 207
****************************************************************************************************
Hotel: 102 Chandler #8 By Lyon Apartments
****************************************************************************************************
Hotel: 102 Chandler #4 By Lyon Apartments
****************************************************************************************************
Hotel: Loews Boston Hotel
****************************************************************************************************
http://www.booking.com/reviews/us/hotel/loews-boston-hotel.html?page=2&
http://www.booking.com/reviews/us/hotel/loews-boston-hotel.html?page=3&
http://www