# Webscraping TripAdvisor
## -- All recommended restaurants on TripAdvisor

### Modules

The modules we will be using for this version of web scraping are:

- **pandas**: We will later use Pandas to create a dataframe for the name of each recommended restaurant and its site link information after webscraping TripAdvisor.

- **requests**: The module that enables us to send http requests so that we can receive all the response data.

- **BeautifulSoup**: I've searched online regarding more tutorials of webscraping and learned that BeautifulSoup is an extremely useful module when trying to get data from html.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as soup

# Part 1: Data Preparation

## Step 1: Top 30 restaurants

### Get the Site Link with the Name

Firstly, we use the requests and BeautifulSoup modules to get the response data from the website. And then, we observed the "Developer Tools" on the website page for webscraping the information we want.

In [49]:
html_main = requests.get("https://www.tripadvisor.com/Restaurants-g32655-Los_Angeles_California.html")
bsobj_main = soup(html_main.content, "lxml")

In [51]:
rank = []
restaurant_name = []
style = []
rate_score = []
restaurant_link = []
for s1 in bsobj_main.find_all("div", class_ = "_1llCuDZj"):
    if s1["data-test"] != "SL_list_item":
        rank.append(s1.find("a", class_="_15_ydu6b").text.split(". ")[0:1][0])
        restaurant_name.append(s1.find("a", class_="_15_ydu6b").text.split(". ")[1:2])
        style.append(s1.find("div", class_ = "MIajtJFg _1cBs8huC _3d9EnJpt").text.split("$")[0])
            
        for s2 in s1.find_all("div", class_ = "MIajtJFg _1cBs8huC"):
            if s2.svg == None:
                rate_score.append("0 of 5 bubbles")
            else:
                rate_score.append(s2.svg["aria-label"])
            
        restaurant_link.append(s1.find("div", class_="wQjYiB7z").a["href"])

## Step 2: The rest of the restaurants - 13460 recommended restaurants in Los Angeles

Use the similar way in step 1, we can get all the restaurant information on the rest of the webpages on TripAdvisor.

In [57]:
top_15000 = list(range(30, 15000, 30))

In [58]:
for num in top_15000:
    html_main = requests.get("https://www.tripadvisor.com/RestaurantSearch-g32655-oa" + str(num) + "-a_geobroaden.true-Los_Angeles_California.html")
    bsobj_main = soup(html_main.content, "lxml")
    
    for s1 in bsobj_main.find_all("div", class_ = "_1llCuDZj"):
        if s1["data-test"] != "SL_list_item":
            rank.append(s1.find("a", class_="_15_ydu6b").text.split(". ")[0:1][0])
            restaurant_name.append(s1.find("a", class_="_15_ydu6b").text.split(". ")[1:2])
            style.append(s1.find("div", class_ = "MIajtJFg _1cBs8huC _3d9EnJpt").text.split("$")[0])
            
            for s2 in s1.find_all("div", class_ = "MIajtJFg _1cBs8huC"):
                if s2.svg == None:
                    rate_score.append("0 of 5 bubbles")
                else:
                    rate_score.append(s2.svg["aria-label"])
            
            restaurant_link.append(s1.find("div", class_="wQjYiB7z").a["href"])

## Step 3: Make a dataFrame for all the webscraping information

In [59]:
restaurant_name_update = []
for name in restaurant_name:
    restaurant_name_update.append(name[0])

In [60]:
restaurant_link_update = []
for site in restaurant_link:
    restaurant_link_update.append("https://www.tripadvisor.com/" + site)

### There are 13460 recommended restaurants in Los Angeles! After 13460, the restaurants are nearby Los Angeles that we may neglect.

In [64]:
data = {"Rank": rank[:13460], "Restaurant Name": restaurant_name_update[:13460], "Style": style[:13460], "Rate": rate_score[:13460], "Site Link": restaurant_link_update[:13460]}
df = pd.DataFrame.from_dict(data)
df

Unnamed: 0,Rank,Restaurant Name,Style,Rate,Site Link
0,1,n/naka,"Japanese, Sushi",5.0 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...
1,2,Raffaello Ristorante,Italian,4.5 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...
2,3,Brent's Delicatessen & Restaurant,"American, Deli",4.5 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...
3,4,Providence,Seafood,4.5 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...
4,5,Angelini Osteria,"Italian, Sicilian",4.5 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...
...,...,...,...,...,...
13455,13456,Sweet Bakery Grocery & Kabob Factory,,0 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...
13456,13457,SanSai Japanese Grill,,0 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...
13457,13458,Winchell's Donut House,American,0 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...
13458,13459,Winchell's Donut House,American,0 of 5 bubbles,https://www.tripadvisor.com//Restaurant_Review...


## Step 3: CSV File -- There are 13460 recommended restaurants on TripAdvisor in total!

We have successfully finished webscraping all the recommended restaurants on TripAdvisor. We want to save them into a csv file for convenience.

In [65]:
df.to_csv("13460_restaurant.csv")

# Part 2: Web App Simulation

## Step 1: Show all the restaurants to the users for selection

In [67]:
restaurant_name_update[:13460]

['n/naka',
 'Raffaello Ristorante',
 "Brent's Delicatessen & Restaurant",
 'Providence',
 'Angelini Osteria',
 'Maccheroni Republic',
 'Sushi Gen',
 'Cafe Gratitude',
 "Langer's",
 'Toast Bakery Cafe',
 'Karl Strauss Brewing Company',
 'Flake',
 'Parkway Grill',
 'Magic Castle',
 'Nickel Diner',
 'Redbird',
 'Pampas Grill',
 'Genwa Korean BBQ',
 'Lemonade',
 'The Boiling Crab',
 'The Luggage Room Pizzeria',
 'Craft Los Angeles',
 "Raffi's Place Restaurant",
 "Foxy's Restaurant",
 'Aroma Coffee & Tea',
 'SUGARFISH by sushi nozawa',
 'Off Vine',
 'Din Tai Fung',
 'The Griddle Cafe',
 "Ca' Del Sole",
 "Cassell's Hamburgers",
 'The Factory Kitchen',
 'Perch',
 'Cleo Hollywood',
 'Republique',
 "Gus's Barbecue - South Pasadena",
 "Gale's Italian Restaurant and Bar",
 "Aliki's Greek Taverna",
 'Musso & Frank Grill',
 "Truxton's American Bistro",
 'Tatsu Ramen',
 'Baco Mercat',
 'Bossa Nova',
 'Ayara Thai Cuisine',
 'Sushi A Go Go',
 'Water Grill',
 "The Butcher's Daughter",
 'Crossroads',
 '

In [43]:
def visit_restaurant(df, want_to_go_name):
    each_name = want_to_go_name.split(".")
    
    for name in each_name:
        if name == each_name[0]:
            visit_html = df[["Restaurant Name", "Site Link"]][df["Restaurant Name"] == name]
        else:
            visit_html = visit_html.append(df[["Restaurant Name", "Site Link"]][df["Restaurant Name"] == name])
    
    return visit_html

## Step 2: Ask for an input from the users

In [45]:
want_to_go_name = str(input("Which restaurant do you want to go?"))
visit_html = visit_restaurant(df, want_to_go_name)

Which restaurant do you want to go?Yamashiro Hollywood


In [46]:
def find_location(visit_html):
    
    location = []
    
    for link in visit_html["Site Link"]:
        html = requests.get(link)
        bsobj = soup(html.content, "lxml")
    
        for loc in bsobj.find_all("script", type = "application/ld+json"):
            if "streetAddress" in loc.string:
                find_part_html = loc.string.split("{")
                for street in find_part_html:
                    if "streetAddress" in street:
                        for value in street.split(","):
                            if "streetAddress" in value:
                                street_name = value.split(":")[1][1:][:-1] + ", "
                                   
                            if "addressLocality" in value:
                                city_name = value.split(":")[1][1:][:-1] + ", "
                                 
                            if "postalCode" in value:
                                zipcode = "CA " + value.split(":")[1][1:][:-1]
                            
        location_info = street_name + city_name + zipcode
        location.append(location_info)
    
    return location

In [47]:
location = find_location(visit_html)

In [48]:
location

['1999 N Sycamore Ave, Los Angeles, CA 90068-3782']