### Web scraping from Tripadvisors' pages to extract info about restaurants (Beautifulsoup)

Different attributes of restaurants are extracted from the page of the most popular restaurants in Toronto anf from individual web page of these restaurants. These attributes are then put into a pandas dataframe. Beautifulsoup is used.

**Import libraries**

In [1]:
import requests
import numpy as np
from bs4 import BeautifulSoup
import time
import pandas as pd

**Extract info from a website**

In [2]:
url = "https://www.tripadvisor.com/Restaurants-g155019-Toronto_Ontario.html"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

r = requests.get(url, headers=headers)
html_doc = r.content

soup = BeautifulSoup(html_doc, 'html.parser')
elements = soup.findAll('a', attrs={'class': "property_title"})

**Extract names of restaurants (30 in total)** <br>
_Tag_: 'a' <br>
_Attribute_: 'class': 'property_title'

In [3]:
elements = soup.findAll('a', attrs={'class': 'property_title'})
restaurants=[]
for el in elements:
    restaurants.append(el.text.strip("\n"))

In [4]:
restaurants[:3]

['STK Toronto', 'ALO RESTAURANT', 'Scaramouche Restaurant']

**Extract number of reviews** <br>
_Tag_: 'div' <br>
_Attribute_: 'class': 'rating rebrand'

In [5]:
reviews=[]
reviews_number = soup.findAll('div', attrs={'class': "rating rebrand"})
for rew in reviews_number:
    reviews.append(rew.contents[3].text.strip("\n"))

In [6]:
reviews[:3]

['350 reviews ', '453 reviews ', '1,388 reviews ']

**Extract ratings** <br>
_Tag_: 'div' <br>
_Attribute_: 'class': 'rating rebrand'

In [7]:
ratings=[]
ratings_ = soup.findAll('div', attrs={'class': "rating rebrand"})

for ratg in ratings_:
    ratings.append(ratg.contents[1]['alt'].strip("\n"))

In [8]:
ratings[:3]

['4.5 of 5 bubbles', '4.5 of 5 bubbles', '4.5 of 5 bubbles']

**Extract prices** <br>
_Tag_: 'span' <br>
_Attribute_: 'class': 'item price'

In [9]:
prices=[]
prices_ = soup.findAll('span', attrs={'class': "item price"})

for p in prices_:
    prices.append(p.text)

Replace symbols with values

In [10]:
for n, i in enumerate(prices):
    if i == "$":
        prices[n] = "Cheap"
    elif i == "$$ - $$$":
        prices[n] = "Medium range"
    elif i == "$$$$":
        prices[n] = "Expensive"

In [11]:
prices[:3]

['Medium range', 'Expensive', 'Expensive']

**Extract cuisines of restaurants** <br>
_Tag_: 'div' <br>
_Attribute_: 'class': 'cuisines' <br>
Info is extracted through "children"

In [12]:
cuisines_ = soup.findAll('div', attrs={'class': "cuisines"})
cuisines=[]
for c in cuisines_:
    children = c.findChildren("a" , recursive=False)
    lst_temp=[]
    for child in children:
        lst_temp.append(child.text)
    cuisines.append(lst_temp)

In [13]:
cuisines[1] #single restaurant

['French',
 'European',
 'Vegetarian Friendly',
 'Vegan Options',
 'Gluten Free Options']

**Web links of restaurants to extract info from individual web pages of restaurants**

In [14]:
links_ = soup.findAll('a', attrs={'class': "property_title"})
links = []
for l in links_:
    links.append(l['href'])

**All restaraunts are processed via loop** <br> 
Different attributes are extracted and put into lists <br>
Listst are:

In [15]:
addresses = []
locations = []
countries = []
phone_numbs = []
ratings_all = []
details_list = []
reviews_full=[]

In [16]:
for link in links:
    time.sleep(5)
    url_r = "https://www.tripadvisor.com" + str(link)
    #print("Processing: ", link)
    r_r = requests.get(url_r, headers=headers)    
    html_doc_r = r_r.content
    soup_r = BeautifulSoup(html_doc_r, 'html.parser')

##/##/##/##/##/##/##/##/##/##/##/##/##/
#address is extracted
    address_ = soup_r.find('span', attrs={'class': "street-address"})
    addresses.append(address_.text)
    

##/##/##/##/##/##/##/##/##/##/##/##/##/
#locations
    locations_ = soup_r.find('span', attrs={'class': "locality"})
    locations.append(locations_.text[:-2]) #exclude last comma



##/##/##/##/##/##/##/##/##/##/##/##/##/
#country is extracted
    country_ = soup_r.find('span', attrs={'class': "country-name"})
    countries.append(country_.text.strip("\n"))



##/##/##/##/##/##/##/##/##/##/##/##/##/
#phone number is extracted

    phone = soup_r.find('div', attrs={'class': "blEntry phone"})
    phone_numbs.append(phone.text.strip("\n"))

##/##/##/##/##/##/##/##/##/##/##/##/##/
#ratings are extracted
    ratings_names = []
    ratings_numbs = []
    ratings_name = soup_r.findAll('div', attrs={'class': "wrap row part "})
    for r in ratings_name:
        ratings_names.append(r.span['alt'])
        
    ratings_numb = soup_r.findAll('div', attrs={'class': "label part "})
    for r_n in ratings_numb:
        ratings_numbs.append(r_n.text.strip())
        
    ratings_dict = {}
    for i in range(len(ratings_names)):
        ratings_dict[ratings_numbs[i]] = ratings_names[i]
    
    ratings_all.append(ratings_dict)

##/##/##/##/##/##/##/##/##/##/##/##/##/
#details (different features of a resturant) are extracted

    details = soup_r.findAll('div', attrs={'id': "RESTAURANT_DETAILS"})
    for d in details:
        rest_det = str(d.contents[3])
    
    soup_det = BeautifulSoup(rest_det, 'html.parser')
    
    ttls = []
    cont = []
    
    details_ = soup_det.findAll('div', attrs={'class': "title"})
    for det in details_:
        ttls.append(det.text.strip())
    
    contents_ = soup_det.findAll('div', attrs={'class': "content"})
    for cn in contents_:
        cont.append(cn.text.strip())
        
        
    detail_dict = {}
    for i in range(len(ttls)):
        detail_dict[ttls[i]] = cont[i]
            
    details_list.append(detail_dict)
    
##/##/##/##/##/##/##/##/##/##/##/##/##/
#long reviews
    reviews_temp=[]
    reviewsf_ = soup_r.findAll('p', attrs={'class': "partial_entry"})
    for rev in reviewsf_:
        reviews_temp.append(rev.text)
    reviews_full.append(reviews_temp)


**Create a table(dataframe) from attributes extracted before the loop**

In [17]:
df = pd.DataFrame(
    {'restaurant_name': restaurants,
     'adress': addresses,
     'country': countries,
     'phone_number': phone_numbs,
     'review': reviews,
     'overall_rating': ratings,
     'price': prices,
    })


In [18]:
df.head()

Unnamed: 0,restaurant_name,adress,country,phone_number,review,overall_rating,price
0,STK Toronto,153 Yorkville Ave,Canada,+1 416-613-9660,350 reviews,4.5 of 5 bubbles,Medium range
1,ALO RESTAURANT,163 Spadina Ave,Canada,+1 416-260-2222,453 reviews,4.5 of 5 bubbles,Expensive
2,Scaramouche Restaurant,1 Benvenuto Pl,Canada,+1 416-961-8011,"1,388 reviews",4.5 of 5 bubbles,Expensive
3,New Orleans Seafood & Steakhouse,267 Scarlett Rd,Canada,+1 416-766-7001,209 reviews,4.5 of 5 bubbles,Medium range
4,Richmond Station,1 Richmond St. West,Canada,+1 647-748-1444,"1,825 reviews",4.5 of 5 bubbles,Medium range


**Convert cuisines python list to a list with commas, put it inside a table**

In [19]:
cuisines_un=[]
for i in range(len(cuisines)):
    cuisines_un.append(",".join(cuisines[i]))

df['cuisines'] = cuisines_un

**Extract ratings of individual pieces of ratings** <br>
Some restaurants do not have "atmosphere", so info is put into a table via "try...except"

In [20]:
df['rating_food'] = np.nan
df['rating_service'] = np.nan
df['rating_atmosphere'] = np.nan
df['rating_value'] = np.nan

for i in range(30):
    try:
        df.loc[i,'rating_food'] = ratings_all[i]['Food']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'rating_service'] = ratings_all[i]['Service']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'rating_atmosphere'] = ratings_all[i]['Atmosphere']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'rating_value'] = ratings_all[i]['Value']
    except KeyError:
         continue

**Remove "bubbles" from ratings**

In [21]:
for i in range(30):
    df.loc[i,'rating_food'] = df.loc[i,'rating_food'].replace('bubbles', '')
    df.loc[i,'rating_service'] = df.loc[i,'rating_food'].replace('bubbles', '')
    df.loc[i,'rating_atmosphere'] = df.loc[i,'rating_food'].replace('bubbles', '')
    df.loc[i,'rating_value'] = df.loc[i,'rating_food'].replace('bubbles', '')
    df.loc[i,'overall_rating'] = df.loc[i,'overall_rating'].replace('bubbles', '')

**Add details to a table as empty columns**

In [22]:
df['Average prices'] = np.nan
df['Cuisine'] = np.nan
df['Meals'] = np.nan
df['Restaurant features'] = np.nan
df['Dining Style'] = np.nan
df['Good for'] = np.nan
df['Open Hours'] = np.nan
df['Location and Contact Information'] = np.nan
df['Description'] = np.nan

**Most of the restaurants do not have all features that are stored in "details", so "try...except" is needed to store info about restaurants**

In [23]:
for i in range(30):
    try:
        df.loc[i,'Average prices'] = details_list[i]['Average prices']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'Cuisine'] = details_list[i]['Cuisine']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'Meals'] = details_list[i]['Meals']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'Restaurant features'] = details_list[i]['Restaurant features']
    except KeyError:
         continue

for i in range(30):
    try:
        df.loc[i,'Dining Style'] = details_list[i]['Dining Style']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'Good for'] = details_list[i]['Good for']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'Open Hours'] = details_list[i]['Open Hours']
    except KeyError:
         continue
for i in range(30):
    try:
        df.loc[i,'Location and Contact Information'] = details_list[i]['Location and Contact Information']
    except KeyError:
         continue

for i in range(30):
    try:
        df.loc[i,'Description'] = details_list[i]['Description']
    except KeyError:
         continue


**Store reviews that are separated by ";"**

In [24]:
new_full_reviews = []
for res in reviews_full:
    new_full_reviews.append(";".join(res))

df['reviews'] = new_full_reviews

## Final Table

**First 5 columns**

In [31]:
df.iloc[:,0:5] .head()

Unnamed: 0,restaurant_name,adress,country,phone_number,review
0,STK Toronto,153 Yorkville Ave,Canada,+1 416-613-9660,350 reviews
1,ALO RESTAURANT,163 Spadina Ave,Canada,+1 416-260-2222,453 reviews
2,Scaramouche Restaurant,1 Benvenuto Pl,Canada,+1 416-961-8011,"1,388 reviews"
3,New Orleans Seafood & Steakhouse,267 Scarlett Rd,Canada,+1 416-766-7001,209 reviews
4,Richmond Station,1 Richmond St. West,Canada,+1 647-748-1444,"1,825 reviews"


**Next 5 columns**

In [26]:
df.iloc[:,5:10].head()

Unnamed: 0,overall_rating,price,cuisines,rating_food,rating_service
0,4.5 of 5,Medium range,"Steakhouse,Vegetarian Friendly,Gluten Free Opt...",4.5 of 5,4.5 of 5
1,4.5 of 5,Expensive,"French,European,Vegetarian Friendly,Vegan Opti...",5.0 of 5,5.0 of 5
2,4.5 of 5,Expensive,"French,International,Vegetarian Friendly,Vegan...",4.5 of 5,4.5 of 5
3,4.5 of 5,Medium range,"Steakhouse,Cajun & Creole,Seafood,Gluten Free ...",4.5 of 5,4.5 of 5
4,4.5 of 5,Medium range,"American,International,Vegetarian Friendly,Glu...",4.5 of 5,4.5 of 5


**Next 5 columns**

In [28]:
df.iloc[:,10:15].head()

Unnamed: 0,rating_atmosphere,rating_value,Average prices,Cuisine,Meals
0,4.5 of 5,4.5 of 5,"UAH 620 -\nUAH 3,098","Steakhouse, Contemporary, Vegetarian Friendly,...","Dinner, Drinks"
1,5.0 of 5,5.0 of 5,"UAH 1,832 -\nUAH 2,667","French, European, Vegetarian Friendly, Vegan O...","Dinner, Drinks"
2,4.5 of 5,4.5 of 5,"UAH 781 -\nUAH 1,158","French, International, Vegetarian Friendly, Ve...",Dinner
3,4.5 of 5,4.5 of 5,UAH 350 -\nUAH 781,"Steakhouse, Cajun & Creole, Seafood, Gluten Fr...",Dinner
4,4.5 of 5,4.5 of 5,,"American, International, Canadian, Vegetarian ...","Lunch, Dinner, Brunch"


**Next 5 columns**

In [29]:
df.iloc[:,15:20].head()

Unnamed: 0,Restaurant features,Dining Style,Good for,Open Hours,Location and Contact Information
0,"Reservations, Private Dining, Seating, Waitsta...",,"Romantic, Large groups, Bar scene, Special occ...",Sunday\n5:00 PM - 12:00 AM\n\n\nMonday\n3:30 P...,"Address:\n 153 Yorkville Ave, Toronto, Ontario..."
1,"Reservations, Seating, Waitstaff, Serves Alcoh...",,"Special occasions, Local cuisine, Bar scene, R...",Tuesday\n5:00 PM - 1:00 AM\n\n\nWednesday\n5:0...,"Address:\n 163 Spadina Ave | 3rd Floor, Toront..."
2,"Seating, Waitstaff, Wheelchair Accessible, Ser...",Fine Dining,"Scenic view, Business meetings, Large groups, ...",Monday\n5:30 PM - 9:30 PM\n\n\nTuesday\n5:30 P...,"Address:\n 1 Benvenuto Pl, Toronto, Ontario M4..."
3,"Takeout, Reservations, Seating, Waitstaff, Par...",,"Business meetings, Special occasions, Families...",Tuesday\n5:00 PM - 10:00 PM\n\n\nWednesday\n5:...,"Address:\n 267 Scarlett Rd | York, Toronto, On..."
4,"Waitstaff, Highchairs Available, Serves Alcoho...",,"Large groups, Romantic, Local cuisine, Special...",Monday\n11:00 AM - 10:30 PM\n\n\nTuesday\n11:0...,"Address:\n 1 Richmond St. West, Toronto, Ontar..."


**Last 2 columns**

In [30]:
df.iloc[:,20:].head()

Unnamed: 0,Description,reviews
0,STK is a unique concept that artfully blends t...,Food ambiance and service is amazing. Matthew ...
1,Hospitality [hos-pi-tal-i-tee] Origin: French;...,Walk-ins with no reservations are no problem f...
2,Scaramouche has long been celebrated by custom...,I had carbonara and it was fantastic! My husba...
3,,I had the steak and it was cooked to perfectio...
4,"Richmond Station is a stopping place, a bustli...","My wife and I had a great meal, watching the c..."
