## Web Scraping

In order to obtain zipcode data matching the businesses in my yelp business dataset, I need to scrape data in order to do that. We will scrape characteristic information about each zipcode from https://www.zipdatamaps.com

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import re

from selenium import webdriver
import os


Set up chromedriver to navigate chrome

In [2]:
chromedriver = "chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver


In this function below, we will extract Current Population, Median Household Income, Average Commute Time, Unemployment Rate from the zipcode given as an input in the function. I return a dataframe containing these figures for this particular zipcode


In [4]:
def get_data_from_zipcode(zipcode):
    url = 'https://www.zipdatamaps.com/' + zipcode
    driver = webdriver.Chrome(chromedriver)
    driver.set_page_load_timeout(20)
    driver.set_script_timeout(20)
    driver.get(url)
    flight_soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    x = flight_soup.find('h2', text = re.compile('Profile Data'))
    x_parent = x.parent.parent
    
    
    pop = x_parent.find('td', text = re.compile('Current Population')).next_element.next_element
    pop_text = pop.text
    
    median_income = x_parent.find('td', text = re.compile('Median Household Income')).next_element.next_element
    median_income_text = median_income.text
    
    
    commute = x_parent.find('td', text = re.compile('Average Commute Time')).next_element.next_element
    commute_text = commute.text
    
    unemployment_rate = x_parent.find('td', text = re.compile('Unemployment Rate')).next_element.next_element
    unemployment_rate_text = unemployment_rate.text
    
    figures  = [zipcode, pop_text, median_income_text, commute_text, unemployment_rate_text]
    col_names = ['Zip Code', 'Population', 'Median Household Income', 'Avg Commute Time', 'Unemployment Rate']
    temp = pd.DataFrame([figures], columns = col_names)
    driver.quit() 
    print('\nScraped Data Complete: ', zipcode)
    return temp


Loading up our zipcode list, from the business dataframe. We will scrape the unique zipcodes from this dataframe.

In [5]:
business_json_path = 'yelp_academic_dataset_business.json'

df_b = pd.read_json(business_json_path, lines = True)


postals = df_b['postal_code'].unique()
postals = df_b['postal_code'][~df_b['postal_code'].str.contains('[A-Za-z]')]
postals_list = postals.unique()

postals_list

array(['80302', '97218', '97214', '32763', '30316', '32804', '43206',
       '78752', '78748', '01960', '32806', '43215', '30083', '32830',
       '34746', '02148', '32809', '97222', '32836', '97210', '30305',
       '02215', '78704', '30309', '02128', '02494', '97229', '02130',
       '02115', '97230', '78735', '78741', '78759', '34711', '98660',
       '32730', '02446', '30340', '97204', '32811', '80301', '32771',
       '01803', '02134', '78701', '34786', '02169', '30345', '78705',
       '78729', '30326', '32818', '02150', '02145', '97213', '30339',
       '98665', '30318', '98685', '01907', '78745', '02465', '01887',
       '32789', '78734', '02472', '30303', '32819', '02116', '97211',
       '32725', '97215', '02136', '78610', '02139', '97217', '01867',
       '97209', '78749', '02140', '43214', '30084', '30030', '02481',
       '32703', '32779', '32780', '43219', '43223', '78731', '43068',
       '78757', '97212', '43220', '78758', '02144', '02131', '98684',
       '98662', '024

For the purposes of this demonstration, I will scrape just the first 10 from this list.

In [6]:
postals_list_10 = postals_list[0:10]

postals_list_10

array(['80302', '97218', '97214', '32763', '30316', '32804', '43206',
       '78752', '78748', '01960'], dtype=object)

Let's run the function on this shortened list. We will loop through the list and run the function for each zipcode

In [7]:
scraped_data = pd.DataFrame()
for i in postals_list_10:
    df = get_data_from_zipcode(i)
    scraped_data = pd.concat((scraped_data,df))



Scraped Data Complete:  80302

Scraped Data Complete:  97218

Scraped Data Complete:  97214

Scraped Data Complete:  32763

Scraped Data Complete:  30316

Scraped Data Complete:  32804

Scraped Data Complete:  43206

Scraped Data Complete:  78752

Scraped Data Complete:  78748

Scraped Data Complete:  01960


In [8]:
scraped_data

Unnamed: 0,Zip Code,Population,Median Household Income,Avg Commute Time,Unemployment Rate
0,80302,26941,$45733,18.9 Minutes,4.9%
0,97218,14561,$50070,24.4 Minutes,5.0%
0,97214,23813,$48887,21.7 Minutes,5.0%
0,32763,21263,$38372,26 Minutes,5.8%
0,30316,31110,$43026,28.2 Minutes,4.2%
0,32804,17312,$56405,18.8 Minutes,5.4%
0,43206,21864,$41630,19.9 Minutes,6.0%
0,78752,18064,$36697,24.1 Minutes,3.9%
0,78748,40651,$74369,27.5 Minutes,3.9%
0,1960,50944,$71022,25.2 Minutes,5.9%


Zipcode data scrape is complete. We can move onto data cleaning/sorting as our next step.