# Your Personal Trip Advisor
#### Recommending attractions in Yosemite National Park based on personal preferences, by referencing reviews written by users on Trip Advisor.

## Part 1 of 4

**Objective**
In this notebook, the aim is to acquire data by scraping reviews (from Trip Advisor) of attractions within Yosemite National Park, by using Selenium, Beautiful Soup, and requests library. 



In [1]:
from bs4 import BeautifulSoup
from selenium import webdriver
import pickle
import time
import requests
import os
import pandas as pd
import regex as re
import numpy as np
import random

#Custom Python module written to help with scraping
import scraping

%load_ext autoreload
%autoreload 2

Only a partial text of the review is displayed by default on Trip Advisor. A 'Read More' button can be used to display the entire review.  
Hence, *Selenium* can be used to open the page and press the read more button; post which the necessary text can be scrapped! 

In [34]:
chromedriver = "E:/chromedriver_win32/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url = "https://www.tripadvisor.in/Attraction_Review-g61000-d139187-Reviews-or5-Glacier_Point-Yosemite_National_Park_California.html#REVIEWS"
driver.get(url)
time.sleep(10)
button = driver.find_element_by_class_name("_3aUwQbpX")
button.click()
ta_reviews_page_soup = BeautifulSoup(driver.page_source,'html5lib')

After some testing, it was discovered that Selenium needs approx 10 seconds overall to successfully let the page open and click on the 'Read More' button. Each page here contains 5 reviews, but the reviews have an individual page too.  
Hence, it was decided that the links to the individual review page will be obtained and the entire review text will be acquired from there and parsed using Beautiful Soup. On the downside, this does mean that 5 additional links will need to be opened and scraped (one for each of the 5 reviews aggregated within page earlier). 

In [3]:
# Col names for the dataframe where the acquired data will be stored

col_names = ['attraction_name', 
 'attraction_id',
 'user_name',
 'user_profile_link',
 'review_date',
 'helpful_votes',
 'rating',
 'review_link',
 'review_text',
 'review_title',
 'experience_date']

df_reviews = pd.DataFrame(columns=col_names)

In [4]:
attraction_url_list = ['https://www.tripadvisor.in/Attraction_Review-g61000-d139187-Reviews-or{}-Glacier_Point-Yosemite_National_Park_California.html',
    'https://www.tripadvisor.in/Attraction_Review-g61000-d489919-Reviews-or{}-Yosemite_Valley-Yosemite_National_Park_California.html',
'https://www.tripadvisor.in/Attraction_Review-g61000-d592681-Reviews-or{}-Mariposa_Grove_of_Giant_Sequoias-Yosemite_National_Park_California.html',
'https://www.tripadvisor.in/Attraction_Review-g61000-d139183-or{}-Reviews-Half_Dome-Yosemite_National_Park_California.html',
'https://www.tripadvisor.in/Attraction_Review-g61000-d593212-or{}-Reviews-Tunnel_View-Yosemite_National_Park_California.html',
'https://www.tripadvisor.in/Attraction_Review-g61000-d146103-or{}-Reviews-Tioga_Pass-Yosemite_National_Park_California.html',
'https://www.tripadvisor.in/Attraction_Review-g61000-d139182-or{}-Reviews-El_Capitan-Yosemite_National_Park_California.html',
'https://www.tripadvisor.in/Attraction_Review-g61000-d483482-or{}-Reviews-Mist_Trail-Yosemite_National_Park_California.html',
'https://www.tripadvisor.in/Attraction_Review-g61000-d483476-or{}-Reviews-Yosemite_Falls-Yosemite_National_Park_California.html',
'https://www.tripadvisor.in/Attraction_Review-g61000-d483481-or{}-Reviews-Vernal_Fall-Yosemite_National_Park_California.html']

attraction_name_list = [re.sub('_', ' ', name.split('-')[-2]) for name in attraction_url_list]

attraction_numbers = [2000, 1861, 1472, 2000, 1159, 1362, 1020, 2000, 1383]

attraction_id_list = ['-'.join(url.split('-')[1:3]) for url in attraction_url_list]

In [5]:
#Ensuring the list of attractions are stored properly
attraction_name_list

['Glacier Point',
 'Yosemite Valley',
 'Mariposa Grove of Giant Sequoias',
 'Half Dome',
 'Tunnel View',
 'Tioga Pass',
 'El Capitan',
 'Mist Trail',
 'Yosemite Falls',
 'Vernal Fall']

In [9]:
#Reading in the aggregated data stored thus far. Due to iterative based scraping
df_reviews_total = pd.read_csv('../Data/attraction_point_reviews.csv')
df_reviews_total.sample(4)

Unnamed: 0,attraction_name,attraction_id,user_name,user_profile_link,review_date,helpful_votes,rating,review_link,review_text,review_title,experience_date
8967,Yosemite Falls,g61000-d483476,David W,/Profile/davidwR3016WA,Aug 2014,17.0,3,/ShowUserReviews-g61000-d483476-r222681344-Yos...,"Ok, we are dumb. We jumped on the free shuttle...",The falls were dry.....no water,August 2014
6123,Tioga Pass,g61000-d146103,Linda K,/Profile/D2977RQlindak,Oct 2015,1.0,2,/ShowUserReviews-g61000-d146103-r315869363-Tio...,Tioga Pass is undeniably beautiful - we drove ...,Make sure you don't mind heights before drivin...,September 2015
6327,Tioga Pass,g61000-d146103,Tracy H,/Profile/TracyH785,Jun 2014,76.0,5,/ShowUserReviews-g61000-d146103-r210958119-Tio...,A beautiful drive from the east entrance of Yo...,Beautiful!,June 2014
1832,Glacier Point,g61000-d139187,Eduardo284,/Profile/Eduardo284,Aug 2015,106.0,5,/ShowUserReviews-g61000-d139187-r296659542-Gla...,this sightseeing is long but the best in my op...,Most recommended sightseeing,June 2015


Helper functions for scraping used below can be found in the python file named [scraping.py](scraping.py)

In [10]:
# Following loop scrapes the reviews going over the attractions list (defined above) and stores the data in a dataframe
# The dataframe is output to an individual csv file (made just for that attraction) first and then appended to the an 
# overall file after. 

for i in range(0,9):

    url = attraction_url_list[i]
    point_name = attraction_name_list[i]
    point_id = attraction_id_list[i]
    n = attraction_numbers[i]
    df_reviews = pd.DataFrame(columns=col_names) 

    for j in range(0, n+1, 5):

        try:
            response = requests.get(url.format(j))
            if response.status_code == 200:

                soup = BeautifulSoup(response.text, 'html5lib')
                reviews_list = scraping.ta_attraction_reviews_parser(soup)
                
                if reviews_list:
                    df_reviews = df_reviews.append(reviews_list, ignore_index = True)

                    df_reviews['attraction_name'] = point_name
                    df_reviews['attraction_id'] = point_id

                    df_reviews.to_csv(f"../Data/attraction_point_reviews_{point_name}.csv" , index = False)

                else:
                    print(f"Reviews List came up empty on url {url}")
                    # input()

            else:
                print(f"status code returned is {response.status_code} for page {url}")
                # input()
        
        except Exception as err_message:
            print(f"For page {url}, received following error message: {err_message}")


    df_reviews.to_csv(f"../Data/attraction_point_reviews_{point_name}.csv" , index = False)
    df_reviews.tail()

    df_reviews_total = pd.concat([df_reviews_total, df_reviews])
    df_reviews_total.to_csv("../Data/attraction_point_reviews.csv", index = False)

df_reviews_total.tail()

Unnamed: 0,attraction_name,attraction_id,user_name,user_profile_link,review_date,helpful_votes,rating,review_link,review_text,review_title,experience_date
10720,Vernal Fall,g61000-d483481,Fairport Travelers,/Profile/Clarkvara,Jul 2008,560.0,4,/ShowUserReviews-g61000-d483481-r17854214-Vern...,Vernal Falls is sort of like the first leg of ...,Nice not too long hike,
10721,Vernal Fall,g61000-d483481,CAtravelfamily,/Profile/CAtravelfamily,Jun 2008,16.0,5,/ShowUserReviews-g61000-d483481-r17268549-Vern...,"Whew, it was a tough climb at times, but once ...",Worth the effort!,
10722,Vernal Fall,g61000-d483481,doodlebugakj,/Profile/doodlebugakj,Jul 2007,,5,/ShowUserReviews-g61000-d483481-r8255121-Verna...,This was a really fun hike but when we came ar...,Wow that was a lot of stairs!,
10723,Vernal Fall,g61000-d483481,Jase2153,/Profile/Jase2153,Sep 2005,9.0,5,/ShowUserReviews-g61000-d483481-r3910762-Verna...,I was visiting Yosemite from Australia and wen...,Worth The Trip,
10724,Vernal Fall,g61000-d483481,booradley2,/Profile/booradley2,Sep 2004,214.0,5,/ShowUserReviews-g61000-d483481-r2512847-Verna...,I've never been especially enthusiastic about ...,Do Not Miss Vernal Fall,


In [11]:
# The code snippet here ensures all the data stored in the individual attraction csv 
df_list = []
for i in range(len(attraction_name_list)):
    name = attraction_name_list[i]
    df_list.append(pd.read_csv(f"../Data/attraction_point_reviews_{name}.csv"))

new_df = pd.concat(df_list, ignore_index = True)

Now that we have the necessary data, we can move to preprocessing this text using various NLP libraries, the steps for which can be found in [2-NLP_Preprocessing](2-NLP_Preprocessing.ipynb)

**Note for Future Steps**  
An additional script can be written to scrape the list of all attractions in Yosemite National Park, which can then be used to the scrape the reviews for these corresponding attractions, making the corpus comprehensive in nature.  

This can also be taken a step further to then scrape the list of all National Parks. This can be used to refer attractions in one park by using a user's likes and dislikes from a previous park visit.  