This Notebook will walk through a basic example of scraping a web page. I'll include a few functions that won't really change too much from site to site (like getting a page via requests or selenium or an api call or whatever), in order to make everything contained in here. The main idea is that we just want functions that will take pages from various sites and conver them to data.

If you've never used a notebook before, just shift-enter on each cell. The functions and variables will be created and stored in the session you are running so that the next time you run a cell you can use whatever you just defined.

In [62]:
import bs4
from bs4 import BeautifulSoup
import multiprocessing
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import os
import pandas as pd
import requests
import math
import re

In [71]:
# Utilities- you shouldn't need to worry about these, just run once
""" Extracts content from a given url
"""  
def get_link_content(url, endpoint = "", params = None, proxies = None):
    result = requests.get(url + endpoint, params, proxies = proxies)
    soup = BeautifulSoup(result.content, "lxml")
    return soup

""" Extracts text from a specific item.
"""  
def get_text(content):
    try:
        return clean_text(content.text)
    except:
        return ""

def clean_text(text):
    text_ascii = re.sub(r'[^\x00-\x7F]','', text)
    return text_ascii.strip("[\n| ]")

In [11]:
# We'll start with just being able to parse a restaurant page for trip advisor.
# We will need to do multipl things from a given site
# Eventually, we will need to do a similar task for parsing map search pages, user pages, etc.
# But for now let's start with the restaurant page from lafayette
TEST_URL = "https://www.tripadvisor.com/Restaurant_Review-g28970-d1318070-Reviews-Lafayette_Restaurant-Washington_DC_District_of_Columbia.html"

In [12]:
# This function will get us a nice parsable object to work with (try printing out the whole mess if you like)
soup = get_link_content(TEST_URL)

If you visit the actual page, you can browse around and find objects that you may want to turn into data features. Right click and inspect element to see the part of the html in which this object resides. For example if I click on the rating, I see something like:

<span class="restaurants-detail-overview-cards-RatingsOverviewCard__overallRating--nohTl">
    "4.5"
<\span>

There are a few ways I can search for this

In [86]:
# Search by class name
soup.find('span', attrs= {'class' : 'restaurants-detail-overview-cards-RatingsOverviewCard__overallRating--nohTl'})

<span class="restaurants-detail-overview-cards-RatingsOverviewCard__overallRating--nohTl">4.5<!-- --> </span>

In [87]:
# Just loop through all the 'span' objects and find ones that look something like a review
spans = soup.findAll('span')
ratings = []
for span in spans:
    try:
        # Capture anything with a number 0-5
        text = get_text(span)
        num = float(text)
        if num >=0 and num < 5:
            ratings.append(span)
    except:
        pass
    
# In this case, this will get us more than just the ratings we are looking for. So we have to be careful
print(ratings)

[<span class="restaurants-detail-overview-cards-RatingsOverviewCard__overallRating--nohTl">4.5<!-- --> </span>, <span class="row_num is-shown-at-tablet">3</span>, <span class="row_num is-shown-at-tablet">4</span>, <span class="row_num ">3</span>, <span class="row_num ">4</span>, <span class="numHelp emphasizeWithColor">1  </span>, <span class="numHelp emphasizeWithColor">1  </span>, <span class="badgetext">2</span>, <span class="badgetext">2</span>, <span class="badgetext">2</span>, <span class="numHelp emphasizeWithColor">2  </span>, <span class="badgetext">2</span>, <span class="numHelp emphasizeWithColor">2  </span>, <span class="numHelp emphasizeWithColor">1  </span>, <span class="numHelp emphasizeWithColor">1  </span>]


Ultimately, what we need is just a single function that will take in a soup object (assuming its from a page similar to the test page) and will output a dictionary of data containing all the elements we want from that page. This will be plugged into some process which will be designed to loop through lots of pages

In [101]:
def search_class(soup, class_name):
    return get_text(soup.find('span', attrs = {'class' : class_name}))

# Takes a restaurant page and converts to a dictionary of items we care about
def soup_to_dict_restaurant(restaurant_page_soup):
    
    # Overall Restaurant Rating
    overall_rating = float(search_class(restaurant_page_soup,'restaurants-detail-overview-cards-RatingsOverviewCard__overallRating--nohTl'))
    
    # Address of the restaurant
    street_address = search_class(restaurant_page_soup,'street-address')
    
    # Number of reviews
    review_count_text = search_class(restaurant_page_soup,'reviewCount')
    review_count = float(re.sub(' reviews', '', review_count_text))
    
    ###
    """
    Add in any more items you want to save. From this page, probably:
    - Restaurant Tags
    - 
    """
    ###
    
    # Build into dictionary
    dict_ = {
        'address' : street_address,
        'rating' : overall_rating,
        'num_reviews' : review_count
           }
    return dict_

In [103]:
# Try it out with our test page
soup_to_dict_restaurant(soup)

{'address': '800 16th St NW', 'rating': 4.5, 'num_reviews': 306.0}

We can repeat this process with a user page (https://www.tripadvisor.com/Profile/jujubean79) or a search page (https://www.tripadvisor.com/Restaurants-g28970-Washington_DC_District_of_Columbia.html we will need to get the links to the restaurant pages), or other page types on trip advisor. It would be really helpful to have good documentation for each site on:
* What types of pages that site has (user, restaurant, individual review, search page, etc.)
* How we will need to loop through those pages (do we start with a search of restaurants, then loop through links to restaurants, then links to users?)
* List of information grabbed from each page type

Gathering this info for each site will take time, but ideally, we should be able to take in a function for each page type- converting that page to a dictionary- and plug it in to a more generalizable process.