                                                                              Helga Sigríður Magnúsdóttir s202027 
                                                                                 Hlynur Árni Sigurjónsson s192302
                                                                             Katrín Erla Bergsveinsdóttir s202026
                                                                                Kristín Björk Lilliendahl s192296
 
 ![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<img src="https://susfans.eu/sites/default/files/clients/DTU.png"  align="right" width="300"/>

# Web Scraping  Restaurant Reviews on Tripadvisor

### What information do we want and how to retreive it ?
 
The idea is to scrape the reviews of all the restaurants in Copenhagen and their general information, such as cuisine type, rank and location. In addition to that we wanted to get the information about reviewers, with information like how many contributions and followers a reviewer has. The scraping tool is built on python packages such as **BeautifulSoup** and **Selenium**. When designing a scraping tool one has to think carefully about all the specific tags and diffrences that can arise during a scraping run like this. The amount of data in our case is really big and the run time was about 45 hours in total.

## Tools and Packages

* Beutiful Soup, for HTML extraction.
* CSV reader, for writing to csv files.
* Selenium Web Driver, for browser loading and actions.

## Contents

* [1. Scraper Info](#scraper)
* [2. Restaurant Info](#restinfo)
* [3. Reviews](#reviews)
* [4. Reviewer Info](#revinfo)
* [5. Improvements](#improv)

---

### Let's start by importing

In [1]:
import requests 
from bs4 import BeautifulSoup
import csv 
from selenium import webdriver
import time
import sys
import os
import argparse
import string
import pickle

 ![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<a id='scraper'></a>
# 1. Scraper info 

A scraper tool is created to gather the reviews and the restaurants information into csv files. The skeleton of the tool was taken from a github page [LaskasP](https://github.com/LaskasP/TripAdvisor-Python-Scraper-Restaurants-2021). It later turned out that the code had many errors and crashed after a few calls. So the scraper tool was fixed and improved, with enriching the information gathered. In addition, information about reviewers was created. Since the data was large and the tool was expected to encounter some errors on the way, the urls links are stored into a csv. In our case, we chose to look into resturant in the Copenhagen area, with over 1900 resturants available. Each restaurant has on average 700 reviews so the scraping time is quick to add up.

The scraper is based on the **beautifulsoup** package and **selenium**. The reason for using selenium is to open and click on things, to retrieve next pages or additional information.

#### Selenium actions:
* Click the "next" button since Tripadvisor only displays 20 restaurants or reviews at each page
* Click the "boxes" that pop out with additional information about reviewers
* Click on the "more" button when a review is exceeding a certain length

#### Procedure:
* Find a Tripadvisor page with the selected area and select only restaurants
* Run the **scrapeRestaurantsUrlsAll** function, this function retrieves all the urls in the selected area
* Run through all the urls and scrape the reviews with the **get_reviews function**
* If successful retrieval of all reviews, then remove the resturant urls csv file
* Sperately run the **scrapeRestaurantInfo** function to get the information of the restaurants

As can be seen in the three code snippets below there are a lot of "try: except:" clauses in the code. This is due to many smaller deviation on the Tripadvisor webpage. Data can be missing for some restaurants so the scraper tries to retrieve them, if not successful, it is left empty.

### Lets create the csv files we write our scraped data to.

In [None]:
pathToReviews = "TripReviews.csv"
pathToRestaurantInfo = "RestaurantInfo.csv"
pathtoReviewers = "reviewerInfo.csv"
pathAllRestaurants = "AllRestaurants.txt" 

with open(pathToStoreInfo, mode='a', encoding="utf-8") as trip:
    data_writer = csv.writer(trip, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    data_writer.writerow(['storeName', 'storeAddress', 'avgRating', 'nrReviews', 'priceCategory','CousineType', 'Rank'])

with open(pathToReviews, mode='a', encoding="utf-8") as trip_data:
    data_writer = csv.writer(trip_data, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    data_writer.writerow(['storeName', 'reviewerUsername', 'ratingDate', 'reviewHeader','reviewText', 'rating'])
    
with open(pathtoReviewers, mode='a', encoding="utf-8") as reviewer_data:
    data_writer = csv.writer(reviewer_data, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    data_writer.writerow(['username', 'location', 'joined', 'nrContributions','nrReviews', 'nrUpvotes', 'nrFollowers','followers', 'nrFollowing','following'])

<a id='restinfo'></a>
# 2. Restaurant Info

When filtering on all restaurants in Copenahgen Tripadvisor has a limit of showing 20 restaurants per page. With the help of **selenium** the "next page" button is pushed until it has reached the end. We carefully just take all of the resturant's URLS that does not show "sponsored" since that would give us duplicates and waste time. The next page button is on the form: 

<img src="webpage_figures/Next_button.png" width="800" height="400">


We get all of the urls first and then iterate through them to first get the reviews and then get the restaurant's info.

In [None]:
# Get urls for all "next" pages in a selected area
def scrapeRestaurantsUrlsAll(url, limit=100):
    store_name = []
    urls = []
    limit_set = 1
    nextPage = True
    while nextPage and limit_set <= limit:
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')
        results = soup.find('div', class_='_1kXteagE')
        stores = results.find_all('div', class_='wQjYiB7z') 
        for store in stores:
            if store.find('a', class_ = '_15_ydu6b').text[0].isdigit(): # skip the ones that er sponsored since they will also come later.
                
                print(store.find('a', class_ = '_15_ydu6b').text)
                unModifiedUrl = str(store.find('a', href=True)['href'])
                urls.append('https://www.tripadvisor.com'+unModifiedUrl)
        limit_set += 1
        #Go to next page if exists
        try:
            print('tried next in finding all')
            unModifiedUrl = str(soup.find('a', class_ = 'nav next rndBtn ui_button primary taLnk',href=True)['href'])
            # print(unModifiedUrl, 'later unmod')
            url = 'https://www.tripadvisor.com' + unModifiedUrl
            # print('new url is ', url)
        except:
            print('no next in finding all')
            nextPage = False

    with open(pathAllRestaurants, 'wb') as f:
        pickle.dump(urls, f)

    print(f'Total restaurant count: {len(urls)}')
    return urls

In [None]:
startingUrl = "https://www.tripadvisor.com/Restaurants-g189541-Copenhagen_Zealand.html" # All Copenahagen restaurants
urls = scrapeRestaurantsUrlsAll(startingUrl, limit=2300)

with open("AllRestaurants.txt", "rb") as f:
    urls = pickle.dump(urls)

## Scrape the restaurants information 

The restaurants information is scraped. Here the most applicable data was retrieved and stored into a seperate csv file. Here the **beautifulsoup** package was sufficient to retreive the data needed. Again, here we see many "try: except:" clauses in the code since there is some missing information for many of the resturants that would cause the code to fail. In the figure below marked in red boxes is the information we were interested in getting.

<img src="webpage_figures/restaurant_info.png" width="800" height="400">

In [None]:
def scrapeRestaurantInfo(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    storeName = soup.find('h1', class_='_3a1XQ88S').text
    try:
        avgRating = soup.find('span', class_='r2Cf69qf').text.strip()
        nrReviews = soup.find('a', class_='_10Iv7dOs').text.strip().split()[0]
    except:
        avgRating = None
        nrReviews = 0
    storeAddress = soup.find('div', class_= '_2vbD36Hr _36TL14Jn').find('span', class_='_2saB_OSe').text.strip()
#     urlAddress = str(soup.find('div', class_ = '_2vbD36Hr _36TL14Jn').find('span').find('a', href=True)['href'])
    
    try:
        cousineType = [word.text for  word in soup.find('span', class_='_13OzAOXO _34GKdBMV').find_all('a')]
        cousine = True
    except:
        cousineType = []
        cousine = False
    nrPos = soup.find('a', class_='_15QfMZ2L').find('b').find('span').text.strip()
    
    # Other rankings 
    all_ranks = []
    try:
        all_ranks = [word.text for word in soup.find('div', class_ = '_3acGlZjD').find_all('div', class_ = '_3-W4EexF')]
    except:
        all_ranks = []
        
    # Other ratings
    all_ratings = []
    try:
        rating = soup.find_all('div', class_='jT_QMHn2')
        rating_type = [x.find('span', class_ = '_2vS3p6SS').text for x in rating]
        true_rating = [x.find('span', class_ = '_377onWB-') for x in rating]
        true_rating = [int(str(x.findChildren('span')).split('_')[3][:2])/10 for x in true_rating]
        all_ratings = list(zip(rating_type,true_rating))
    except:
        all_ratings = []
        
    with open(restaurantInfo, mode='a', encoding="utf-8") as trip:
        data_writer = csv.writer(trip, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
        if len(cousineType) > 1:
            data_writer.writerow([storeName, storeAddress, avgRating, nrReviews, cousineType[0], cousineType[1:], nrPos, all_ranks, all_ratings])
        elif len(cousineType) == 1:
            data_writer.writerow([storeName, storeAddress, avgRating, nrReviews, cousineType[0], [], nrPos, all_ranks, all_ratings])
        else:
            data_writer.writerow([storeName, storeAddress, avgRating, nrReviews, [], [], nrPos, all_ranks, all_ratings])

In [None]:
with open("AllRestaurants.txt", "rb") as f:   # Unpickling
    urls = pickle.load(urls)

finished = []
bad_url = []
for url in urls:
    try:
        scrapeRestaurantInfo(url)
        finished.append(url)
    except:
        bad_url.append(url)

<a id='reviews'></a>
# 2. Reviews

A function for getting the reviews was created, it uses selenium to go through all the "next" pages since Tripadvisor only displays 10 reviews per page. Selenium is also used to click the "more" button when a review is to long to display in the container it is in. Here we can see how there is an extra text available if the more button is pushed.

<img src="webpage_figures/Before_more.png" width="800" height="400">

Now after using selenium to click on all the **more** buttons on the page that have reviews as their id we can see the whole review and get it.

<img src="webpage_figures/After_more.png" width="800" height="400">


In [None]:
def get_reviews(url):
    print(url)

    nextPage = True
    while nextPage:
        #Requests
        driver.get(url)
        time.sleep(1)
        #Click More button
        more = driver.find_elements_by_xpath("//span[@class='taLnk ulBlueLinks'][contains(text(),'More')]")
        # Push all buttons that unclude the "More" option on each review.
        for x in range(0,len(more)):
            try:
                driver.execute_script("arguments[0].click();", more[x])
                time.sleep(3)
            except:
                pass
            
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        #Store name
        storeName = soup.find('h1', class_='_3a1XQ88S').text
        #Reviews
        results = soup.find('div', class_='listContainer hide-more-mobile')
        try:
            reviews = results.find_all('div', class_='prw_rup prw_reviews_review_resp')
        except Exception:
            continue
        #Export to csv
        try:
            with open(pathToReviews, mode='a', encoding="utf-8") as trip_data:
                data_writer = csv.writer(trip_data, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
                for review in reviews:
                    ratingDate = review.find('span', class_='ratingDate').get('title')
                    reviewHeader = review.find('span', class_='noQuotes').text
                    text_review = review.find('p', class_='partial_entry')
                    if len(text_review.contents) > 2:
                        reviewText = str(text_review.contents[0][:-3]) + ' ' + str(text_review.contents[1].text)
                    else:
                        reviewText = text_review.text
                    reviewerUsername = review.find('div', class_='info_text pointer_cursor')
                    reviewerUsername = reviewerUsername.select('div > div')[0].get_text(strip=True)
                    rating = review.find('div', class_='ui_column is-9').findChildren('span')
                    rating = str(rating[0]).split('_')[3].split('0')[0]
                    data_writer.writerow([storeName, reviewerUsername, ratingDate, reviewHeader, reviewText, rating])
        except:
            pass

        #Go to next page if exists
        try:
            unModifiedUrl = str(soup.find('div', class_ = 'prw_rup prw_common_responsive_pagination').find('a', class_='nav next ui_button primary', href=True).get('href'))
            # unModifiedUrl = str(soup.find('a', class_ = 'nav next ui_button primary',href=True)['href'])
            url = 'https://www.tripadvisor.com' + unModifiedUrl
        except:
            nextPage = False


Loop through all the restaurants urls and get all reviews for that restaurant, here urls_left is a 

In [None]:
with open("urls_left.txt", "rb") as f:   # Unpickling
    urls_left = pickle.load(f)

driver_path = f'{os.getcwd()}/chromedriver' # Get driver to run selenium
driver = webdriver.Chrome(driver_path)

finished = []
bad_url = []
url_slice = urls_left[0:100]
i = 0
for url in url_slice:
    try:
        get_reviews(url)
        finished.append(url)
        print(i)
        i+=1
    except:
        bad_url.append(url)
        i+=1

<a id='revinfo'></a>
# 4. Reviewer Info


After we had stored all the reviews data we gathered all unique reviewers and looked into their profile for additional information. Here **selenium** came to the rescue, as it was necessary to click buttons on the reviewers own page. The information stored here is mainly in the hope to get the connection between reviewers and restaurants. Tripadvisor has a community of reviewers and they can follow each other as on social platforms. The information in that regard is gathered along with the total reviews and "upvotes" the reviewer gives. The hope here is to shed light on the influence of specific reviewers and the value it could add to restaurants. Here detecting bad or fraudulent reviews is hopefully possible with the data at hand. The most frequent available data is the location and the join date of the reviewer. This information is quite important since a network can be created based on those attributes.


The lyout of the reviewer is on the form: 

<img src="webpage_figures/Reviewer.png" width="800" height="400">

We have multiple field that we want to get but some of them are hidden, when we click on contributions we get the following window: 

<img src="webpage_figures/Reviewer_cont.png" width="800" height="400">

and subequently when clickin on follower or following we get what people are at play.

<img src="webpage_figures/Reviewer_following.png" width="800" height="400">

In [None]:
# Get all the reviwer info, location, join date, review count, upvotes, followers and following.
def reviewerInfo(url):
    username = url
    full_url = f"https://www.tripadvisor.com/Profile/{url}"
    driver.get(full_url)
    time.sleep(1)

    # Get Intro info, location and join date
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    try:
        location = soup.find('span', class_ = "_2VknwlEe _3J15flPT default").text
    except:
        location = None

    try:
        joined = soup.find('span', class_ = "_1CdMKu4t").text
    except:
        joined = None


    all_links = soup.find_all('div', class_ = '_1aVEDY08')
    # link = driver.find_elements_by_xpath("//div[@class='nkw-3XeH']/div[1]/span[2]/a")

    # # Get the contributions info
    nrContributions = int(str(all_links[0].text).split()[1])
    if nrContributions > 0:
        link = driver.find_elements_by_xpath("//div[@class='nkw-3XeH']/div[1]/span[2]/a")
        driver.execute_script("arguments[0].click();", link[0])
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        nrReviews = int(str(soup.find('span', class_ = 'ui_icon pencil-paper _1LSVmZLi').parent.text).split()[0])
        try:
            nrUpvotes = int(str(soup.find('span', class_ ='ui_icon thumbs-up _1LSVmZLi _3zmXi7gU').parent.text).split()[0])
        except:
            nrUpvotes = 0
        close = driver.find_elements_by_xpath("//div[@class='_2EFRp_bb _9Wi4Mpeb']")
        driver.execute_script("arguments[0].click();", close[0])
    else:
        nrReviews = 0
        nrUpvotes = 0

    # Get Followers
    nrFollowers = int(str(all_links[1].text).split()[1])
    if nrFollowers > 0:
        link = driver.find_elements_by_xpath("//div[@class='nkw-3XeH']/div[2]/span[2]/a")
        driver.execute_script("arguments[0].click();", link[0])
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        followers = [word.text for word in soup.find('div', class_='_1caczhWN').find_all('span', class_='gf69u3Nd')]
        close = driver.find_elements_by_xpath("//div[@class='_2EFRp_bb _9Wi4Mpeb']")
        driver.execute_script("arguments[0].click();", close[0])
    else:
        followers = []

    # Get all following
    nrFollowing = int(str(all_links[2].text).split()[1])
    if nrFollowing > 0:
        link = driver.find_elements_by_xpath("//div[@class='nkw-3XeH']/div[3]/span[2]/a")
        driver.execute_script("arguments[0].click();", link[0])
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        following = [word.text for word in soup.find('div', class_='_1caczhWN').find_all('span', class_='gf69u3Nd')]
        close = driver.find_elements_by_xpath("//div[@class='_2EFRp_bb _9Wi4Mpeb']")
        driver.execute_script("arguments[0].click();", close[0])
    else:
        following = []

    with open(pathtoReviewers, mode='a', encoding="utf-8") as reviewer_data:
        data_writer = csv.writer(reviewer_data, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
        data_writer.writerow([username, location, joined, nrContributions,nrReviews, nrUpvotes, nrFollowers, followers, nrFollowing,following])

Get all the reviewers from our reviews data-set, here we need to merge all of our csv files together since we did many paralell runs to get the data. 

In [None]:
reader = csv.reader(open("reviews1.csv"))

f = open("reviews_all.csv", "w")
writer = csv.writer(f)

for row in reader:
    writer.writerow(row)

for x in range(2,12):
    file = f"reviews{x}.csv"
    reader = csv.reader(open(file))
    next(reader)
    for row in reader:
        writer.writerow(row)
f.close()

In [7]:
df = pd.read_csv("all_reviews.csv")
df.head()

Unnamed: 0,storeName,reviewerUsername,ratingDate,reviewHeader,reviewText,rating
0,Maple Casual Dining,918emmaf,"December 5, 2020",Exquisite,We visited Maple in Friday night and had a won...,5
1,Maple Casual Dining,hildurj2016,"November 19, 2020",Perfect wedding dinner,"Excellent food, drinks and service!! Me and my...",5
2,Maple Casual Dining,Judy B,"October 27, 2020",Beautifully Presented Food,I visited this restaurant on my first ever vis...,5
3,Maple Casual Dining,EldBjoern,"October 18, 2020",Very good food and very pleasant people,We ate dinner in their restaurant. The waiter ...,5
4,Maple Casual Dining,MacondoExpresss,"October 13, 2020",A lovely birthday dinner,Visited as a couple to celebrate my birthday. ...,5


We only want to look at unique usernames an we see that we have **54.541** unique reviewers in our data-set.

In [19]:
all_reviewers = list(df.reviewerUsername.unique())
print(len(all_reviewers))

54541


In [20]:
# Save all the reviewers we want data about
with open('all_reviewers.txt', 'wb') as f:
    pickle.dump(all_reviewers, f)

No we loop through each reviewer id and use our function to retrieve the data.

In [None]:
# Initialize the chrome driver with selenium
driver_path = f'{os.getcwd()}/chromedriver'
driver = webdriver.Chrome(driver_path)

with open("all_reviewers.txt", "r") as f:
    reviewers = f.readlines()

bad_url = []
reviewer_slice = reviewers[0:10000] # Take a slice of 10000 reviewers
for reviewer in reviewer_slice:
    try:
        reviewerInfo(reviewer)
    except:
        bad_url.append(reviewer)

#### Since this was done in many runs we now merge all csv files of reviewer info together:

In [5]:
reader = csv.reader(open("reviewer_info/reviewerInfo_1.csv"))

f = open("all_reviewer_info.csv", "w")
writer = csv.writer(f)

for row in reader:
    writer.writerow(row)

for x in range(2,6):
    file = f"reviewer_info/reviewerInfo_{x}.csv"
    reader = csv.reader(open(file))
    next(reader)
    for row in reader:
        writer.writerow(row)
f.close()

<a id='improv'></a>
# 5. Improvements

The total time that took to run all of the scraping needed was about 45 hours. You try your best to think of all exceptions or data that need to be stored but sometimes you overlook somethin that later turns out to be very valuable or even essential for your work. In our case we saw a flaw in our "GetReviews" function where the restaurant individual ID is not stores anywhere, though is was easily available. This caused reviews for restaurant chains to not be identifiable to a specific restaurant since they have the same name. We moved on even though this mistake was found since all the data was allready scraped and we take this as a lesson to more carefully uniquely identafy something that could conflict with other data.