# Web Scraping

Web scraping is used for extracting data from websites. 

The process begins by inspecting the website's html and identifying the elements that contain the data we are interested in. The data can be located in different types of html elements such as div, span etc. and depending on how it is organized we approach it differently.  

- Grouped Data (contact details, product details etc.) will be stored in similar himl elements. We will have to loop over these similar elements and extract the desired information. 
- Ungrouped Data may be stored in an html element that is shared across other pieces of data. If there is no unique class or id for the desired data, we can store all the instances of this html element in a list or data frame and use indexing to access the desired information.

#### Below are examples of web scraping on different types of data.

In [26]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import requests

## Search Result Data

### Indeed Job Postings

In [4]:
#Indeed search for Business Analyst roles in Toronto
page = requests.get("https://ca.indeed.com/jobs?q=Business+Analyst&l=Toronto%2C+ON")

#Checking to see what response code we get from the page we requested
#The HTTP 200 OK success status response code indicates that the request has succeeded
page

<Response [200]>

In [5]:
#We can use the BeautifulSoup library to parse this document
soup = BeautifulSoup(page.content, 'html.parser')

In [6]:
#We can use the find_all method to search for items by class or by id
jobs_html = soup.find_all('td', class_="resultContent")

#Creating an empty dataframe with a column to store all values we are interested in
jobs_df = pd.DataFrame(columns=['Job Title', 'Company Name', 'Location'])

for jobs in jobs_html:
    #Some jobs have a label called "new" to indicate a new posting. We need to check for this to identify which span element contains the job title
    if (jobs.find_all('span')[0]).text.strip() == "new":
        job_title_html = jobs.find_all('span')[1]
    else:
        job_title_html = jobs.find_all('span')[0]
    
    company_name_html = jobs.find('span', class_="companyName")
    location_html = jobs.find('div', class_="companyLocation")
   
    job_title = job_title_html.text.strip()
    company_name = company_name_html.text.strip()
    location = location_html.text.strip()
    
    df = {'Job Title':job_title, 'Company Name':company_name, 'Location':location}
    
    jobs_df = jobs_df.append(df, ignore_index = True)

jobs_df

Unnamed: 0,Job Title,Company Name,Location
0,Category Business Analyst,Canadian Tire,"Toronto, ON"
1,business management analyst,MSG GLOBAL SOLUTIONS CANADA INC,"Toronto, ON"
2,"Business Analyst, Commercial",Canada Goose Inc.,"Toronto, ON"
3,Business Analyst,ThreePDS Inc,"Toronto, ON"
4,"Intern, Business Analyst",Equitable Bank,"Toronto, ON"
5,Junior Business Analyst (6 month contract),H&M,"Toronto, ON"
6,Business Analyst / QA,MEDCAN,"Toronto, ON"
7,Business Support Analyst,BMO Financial Group,"Toronto, ON"
8,Jr. / Int. Business Analyst,JLL,"Toronto, ON"
9,Business Analyst,Distributel,"Toronto, ON"


### Yelp Listings

In [75]:
#Yelp search for Coffee Shops in Los Angeles
page = requests.get("https://www.yelp.com/search?find_desc=Coffee+Shop&find_loc=Los+Angeles%2C+CA")

#Checking to see what response code we get from the page we requested
#The HTTP 200 OK success status response code indicates that the request has succeeded
page

<Response [200]>

In [76]:
#We can use the BeautifulSoup library to parse this document
soup = BeautifulSoup(page.content, 'html.parser')

In [77]:
#We can use the find_all method to search for items by class or by id
shops_html = soup.find_all('div', class_="arrange-unit__09f24__eFC_S arrange-unit-fill__09f24__1bMmp border-color--default__09f24__3Epto")

#Creating an empty dataframe with a column to store all values we are interested in
shops_df = pd.DataFrame(columns=['Coffee Shop Name', 'Number of Reviews', 'Location'])

for shop in shops_html:
    #Extracting the desired html elements
    shop_name_html = shop.find('a', class_="css-og60gk")
    number_of_reviews_html = shop.find('span', class_="reviewCount__09f24__3GsGY css-e81eai")
    location_html = (shop.find('p', class_="css-1j7sdmt"))
    
    #We get an issue with some Nonetype objects appearing when we select the html elements. 
    #The if statement below allows us to avoid any Nonetype values 
    if None not in (shop_name_html, number_of_reviews_html, location_html):
       
        
        #Extracting the text from the html elements
        shop_name = shop_name_html.text.strip()
        number_of_reviews = number_of_reviews_html.text.strip()
        
        #We need to use .find() a second time to access the <span> tag within the <p> tag
        location = location_html.find('span', class_="css-e81eai").text.strip()
        
        df = {'Coffee Shop Name':shop_name, 'Number of Reviews':number_of_reviews, 'Location':location}
        
        shops_df = shops_df.append(df, ignore_index = True)

shops_df

Unnamed: 0,Coffee Shop Name,Number of Reviews,Location
0,Alibi Coffee,157,Larchmont
1,Coffee Connection,626,Mar Vista
2,Alchemist Coffee Project,1164,Wilshire Center
3,Bungalow 40,87,Hollywood
4,Coffee MCO,364,Pico-Union
5,Coffee For Sasquatch,311,Hancock Park
6,Document Coffee Bar,592,Koreatown
7,Stereoscope Coffee Company,48,Echo Park
8,Boxx Coffee Roasters,42,Arts District
9,Neighborhood,88,Fairfax


## Tabular Data

### IMDb Box Office Charts 

In [78]:
#IMDb page with today's box office charts
page = requests.get("https://www.imdb.com/chart/boxoffice/")

#Checking to see what response code we get from the page we requested
#The HTTP 200 OK success status response code indicates that the request has succeeded
page

<Response [200]>

In [79]:
#We can use the BeautifulSoup library to parse this document
soup = BeautifulSoup(page.content, 'html.parser')

In [80]:
#We can use the find_all method to search for items by class or by id
imdb_html = soup.find_all('tr')

#Creating an empty dataframe with a column to store all values we are interested in
imdb_df = pd.DataFrame(columns=['Title', 'Weekend Earnings', 'Gross Earnings', 'Weeks in Theaters'])

for movie in imdb_html:
    #Extracting the desired html elements
    title_html = movie.find('td', class_="titleColumn")
    weekend_html = movie.find('td', class_="ratingColumn")
    gross_html = movie.find('span', class_="secondaryInfo")
    weeks_html = movie.find('td', class_="weeksColumn")
    
    #We get an issue with some Nonetype objects appearing when we select the html elements. 
    #The if statement below allows us to avoid any Nonetype values 
    if None not in (title_html, weekend_html, gross_html,weeks_html):
        
        #Extracting the text from the html elements
        title = title_html.text.strip()
        weekend = weekend_html.text.strip()
        gross =  gross_html.text.strip()
        weeks = weeks_html.text.strip()

        df = {'Title':title, 'Weekend Earnings':weekend, 'Gross Earnings':gross, 'Weeks in Theaters':weeks}

        imdb_df = imdb_df.append(df, ignore_index = True)

imdb_df

Unnamed: 0,Title,Weekend Earnings,Gross Earnings,Weeks in Theaters
0,Venom: Let There Be Carnage,$90.0M,$90.0M,1
1,The Addams Family 2,$17.3M,$17.3M,1
2,Shang-Chi and the Legend of the Ten Rings,$6.1M,$206.2M,5
3,The Many Saints of Newark,$4.7M,$4.7M,1
4,Dear Evan Hansen,$2.5M,$11.8M,2
5,Free Guy,$2.3M,$117.6M,8
6,Candyman,$1.3M,$58.9M,6
7,Jungle Cruise,$703K,$116.1M,10
8,Chal Mera Putt 3,$644K,$644K,1
9,The Jesus Music,$549K,$549K,1


## Text Data

### Wikipedia Page

In [11]:
#Wikipedia page on Televisions
page = requests.get("https://en.wikipedia.org/wiki/Television")

#Checking to see what response code we get from the page we requested
#The HTTP 200 OK success status response code indicates that the request has succeeded
page

<Response [200]>

In [12]:
#We can use the BeautifulSoup library to parse this document
soup = BeautifulSoup(page.content, 'html.parser')

In [21]:
#We can use the find_all method to search for items by class or by id
tvs_html = soup.find_all('p')

#Creating an empty list to store all values we are interested in
tv_wiki = []

for tv in tvs_html:
    tv_wiki.append(tv.text.strip())    

#Removing any empty strings
tv_wiki.remove('')

#See the first three paragraphs
tv_wiki[0:3]

['Television, sometimes shortened to TV or telly, is a telecommunication medium used for transmitting moving images in monochrome (black and white), or in color, and in two or three dimensions and sound. The term can refer to a television set, a television show, or the medium of television transmission. Television is a mass medium for advertising, entertainment, news, and sports.',
 'Television became available in crude experimental forms in the late 1920s, but it would still be several years before the new technology would be marketed to consumers. After World War II, an improved form of black-and-white television broadcasting became popular in the United Kingdom and United States, and television sets became commonplace in homes, businesses, and institutions. During the 1950s, television was the primary medium for influencing public opinion.[1] In the mid-1960s, color broadcasting was introduced in the U.S. and most other developed countries. The availability of various types of archi