# Web Scraping with BeautifulSoup

Sometimes there may not be an easily accessible data set for your project. However, there may be data that exists on the web which you can scrape. One way to do this in python is to use `BeautifulSoup`.

## What we will accomplish in this notebook

In this notebook we will:
- Discuss the structure of HTML code,
- Introduce the `bs4` pacakge,
- Parse simple HTML code with `BeautifulSoup`,
- Review how to request the HTML code from a url,
- Scrape data from an actual webpage and
- Touch on some of the issues that may arise when web scraping.

In [1]:
## Import base packages we'll use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from seaborn import set_style
set_style("whitegrid")

## Scraping data with `BeautifulSoup`

### Importing `BeautifulSoup`

In order to use `BeautifulSoup` we first need to make sure that we have it installed on our computer. Try to run the following code chunks.

In [2]:
## this imports BeautifulSoup from its package, bs4
import bs4

In [3]:
## Run this to check your version
## I wrote this notebook with version  4.12.2
print(bs4.__version__)

4.12.2


If the above code does not work you will need to install the package before being able to run the code in this notebook. Here are installation instructions from the `bs4` documentation:
- Via conda: <a href="https://anaconda.org/conda-forge/bs4">https://anaconda.org/conda-forge/bs4</a>,
- Via pip: <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup">https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup</a>.

### The structure of an HTML page

`BeautifulSoup` takes in an HTML document and will 'parse' it for you so that you can extract the information you want. To best understand what that means we will look at a toy example of a webpage. To see what the snippet of HTML code below looks like in a web browser click here <a href="SampleHTML.html">SampleHTML.html</a>.

In [4]:
# This is an html chunk
# It has a head and a body, just like you
# This example comes from the BeautifulSoup official documentation here:  https://www.crummy.com/software/BeautifulSoup/bs4/doc/

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""

We can now use `BeautifulSoup` to parse this simple HTML chunk.

In [5]:
## First we import the BeautifulSoup object
from bs4 import BeautifulSoup

In [6]:
## Now we make a BeautifulSoup object out of the html code
## The first input is the html code
## The second input is how you want BeautifulSoup
## to parse the code
soup = BeautifulSoup(html_doc, 'html.parser')

In [7]:
## Let's use the prettify method to make our html pretty and see what it has to say
## Ideally this is how someone writing pure html code would write their code
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



Html files have a natural tree structure that we will briefly cover now. Here is the tree of our sample HTML:

<img src = "lecture_2_assets/html_tree.png" width = "50%"></img>

Each level in the tree represents a 'generation' of the html code. The body has 3 "p" children, the leftmost "p" has one "b" child. `BeautifulSoup` helps us traverse these trees to gather the data we want.

In [8]:
## Below are some examples of beautifulsoup methods and 
## attributes that help us better understand the structure 
## of html code

In [9]:
## We can traverse to the "title" by working our way through
## the tree
soup.head.title

<title>The Dormouse's story</title>

In [10]:
## Notice we can also get the title like so
## This is because this is the first and only title 
## in the code
soup.title

<title>The Dormouse's story</title>

In [11]:
## What if I just want the text from the title?
soup.title.text

"The Dormouse's story"

In [12]:
## What html structure is the title's parent?
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [13]:
## What is the first a of the html document?
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [14]:
## What is the first a's class?
soup.a['class']

['sister']

In [15]:
## There are multiple a's: can I find all of them?

for a in soup.find_all('a'):
    print()
    print(a['class'], a.text)


['sister'] Elsie

['sister'] Lacie

['sister'] Tillie


In [16]:
## Find the first p of the document
## What is the first p's class? 
## What string is in that p?
print(soup.p)

print(soup.p['class'])

print(soup.p.text)

<p class="title"><b>The Dormouse's story</b></p>
['title']
The Dormouse's story


In [17]:
## For all of the a's in the document find their href

for a in soup.find_all('a'):
    print(a['href'])


http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


## Scraping real webpages

Let's now pivot to a real webpage. In this example we will imagine we are in the spot of wanting to scrape information from our Erdős Institute I2I website here:  

https://www.erdosinstitute.org/invitations-to-industry
### Sending a request

In order to scrape that data we need to have the HTML code associated with the page. In python we can do this with the `requests` module.

In [18]:
import requests

In [19]:
response = requests.get(url="https://www.erdosinstitute.org/invitations-to-industry")

First we will note that, if the request was successful, we should be seeing `<Response [200]>` below. This tells us that the request was recieved and the data was returned successfully. If we instead saw something like `404` or `500`, we would know that something went wrong. For a list of possible response codes see <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses">https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses</a>.

In [20]:
response.status_code

200

In [21]:
## The HTML code is stored in response.text
print(response.text)

<!DOCTYPE html>
<html lang="en">
<head>
  
  <meta charset='utf-8'>
  <meta name="viewport" content="width=device-width, initial-scale=1" id="wixDesktopViewport" />
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="generator" content="Wix.com Website Builder"/>

  <link rel="icon" sizes="192x192" href="https://static.wixstatic.com/media/55f531_1a160ca9baef49189bf7ceeab4a619d6%7Emv2.png/v1/fill/w_192%2Ch_192%2Clg_1%2Cusm_0.66_1.00_0.01/55f531_1a160ca9baef49189bf7ceeab4a619d6%7Emv2.png" type="image/png"/>
  <link rel="shortcut icon" href="https://static.wixstatic.com/media/55f531_1a160ca9baef49189bf7ceeab4a619d6%7Emv2.png/v1/fill/w_32%2Ch_32%2Clg_1%2Cusm_0.66_1.00_0.01/55f531_1a160ca9baef49189bf7ceeab4a619d6%7Emv2.png" type="image/png"/>
  <link rel="apple-touch-icon" href="https://static.wixstatic.com/media/55f531_1a160ca9baef49189bf7ceeab4a619d6%7Emv2.png/v1/fill/w_180%2Ch_180%2Clg_1%2Cusm_0.66_1.00_0.01/55f531_1a160ca9baef49189bf7ceeab4a619d6%7Emv2.png" type="image

In [22]:
## We can now parse this data with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

In [23]:
soup.head.title

<title>The Erdős Institute | Invitations to Industry | Career Exploration Seminars | Online</title>

### Web developer tools

As we can see, this is much messier than our simple example above. 

We want to find the names of the companies associated with each of the logos at the bottom of the page.

To hone in on this information we can utilize the web developer tools for your browser.  These are generally found in dropdown menus from your browser.  For example, in chrome you can access it via View > Developer > Developer Tools.

The web developer tools will allow you to find out where various components of the webpage live in the code. For example, you should be able to hover over an item on the webpage and it will highlight what HTML structure holds it.

We can use this information to get the data we desire.

Looking at one image and moving up the tree, we can see that all of the images are contained in the following div:

```html
<fluid-columns-repeater horizontal-gap="10" vertical-gap="10" justify-content="center" direction="ltr" container-id="comp-lr5r5app_wrapper" items="54" class="GPmm8Z" role="list" style>
```
Is this the only such div?  Let's check.

In [24]:
# Find all divs with this role and class.

soup.find_all("fluid-columns-repeater",attrs={"class":"GPmm8Z", "container-id":"comp-lr5r5app_wrapper"})

[<fluid-columns-repeater class="GPmm8Z" container-id="comp-lr5r5app_wrapper" direction="ltr" horizontal-gap="10" items="54" justify-content="center" role="list" style="visibility:hidden" vertical-gap="10"><div class="comp-lr5r5ax5 YzqVVZ wixui-repeater__item" id="comp-lr5r5ax5__d441214c-80b5-4166-9c1e-b904a6d19089"><div class="MW5IWV" data-hook="bgLayers" id="bgLayers_comp-lr5r5ax5"><div class="LWbAav Kv1aVt" data-testid="colorUnderlay"></div><div class="VgO9Yg" id="bgMedia_comp-lr5r5ax5"></div></div><div class="" data-mesh-id="comp-lr5r5ax5__d441214c-80b5-4166-9c1e-b904a6d19089inlineContent" data-testid="inline-content"><div data-mesh-id="comp-lr5r5ax5__d441214c-80b5-4166-9c1e-b904a6d19089inlineContent-gridContainer" data-testid="mesh-container-content"><div class="MazNVa comp-lr5r73fn wixui-image rYiAuL" id="comp-lr5r73fn__d441214c-80b5-4166-9c1e-b904a6d19089" title="BOA ML.jpg"><a class="j7pOnl" data-testid="linkElement" href="https://business.bofa.com/content/boaml/en_us/home.html"

In [25]:

past_participants_container = soup.find_all("fluid-columns-repeater",attrs={"class":"GPmm8Z", "container-id":"comp-lr5r5app_wrapper"})[0]

In [26]:
# Check it for all of the "img" tags.

past_participants_container.find_all("img")

[<img alt="Shahnawaz Khalid" height="100" src="https://static.wixstatic.com/media/36b510_add4e6997405440e84ad1d367633bcf0~mv2.jpg/v1/fill/w_100,h_100,al_c,q_80,usm_0.66_1.00_0.01,blur_3,enc_auto/ErdosLogoNew2023.jpg" style="width:100%;height:100%;object-fit:cover;object-position:50% 50%" width="100"/>,
 <img alt="Kari Eifler" height="100" src="https://static.wixstatic.com/media/36b510_ed151276b7404c57b30f39d8ce329126~mv2.jpg/v1/fill/w_100,h_100,al_c,q_80,usm_0.66_1.00_0.01,blur_3,enc_auto/ErdosLogoNew2023.jpg" style="width:100%;height:100%;object-fit:cover;object-position:50% 50%" width="100"/>,
 <img alt="Biplav Choudhury" height="100" src="https://static.wixstatic.com/media/36b510_150a2b19649047c190dad2adf1db3a74~mv2.jpg/v1/fill/w_100,h_100,al_c,q_80,usm_0.66_1.00_0.01,blur_3,enc_auto/ErdosLogoNew2023.jpg" style="width:100%;height:100%;object-fit:cover;object-position:50% 50%" width="100"/>,
 <img alt="Preethi Raghavan" height="100" src="https://static.wixstatic.com/media/36b510_f5eb

In [27]:
past_participants_container.find_all("img")[0].parent.parent

<a class="j7pOnl" data-testid="linkElement" href="https://business.bofa.com/content/boaml/en_us/home.html" rel="noopener" target="_blank"><wow-image class="HlRz5e BI8PVQ" data-bg-effect-name="" data-has-ssr-src="true" data-image-info='{"containerId":"comp-lr5r73fn__d441214c-80b5-4166-9c1e-b904a6d19089","displayMode":"fill","targetWidth":100,"targetHeight":100,"isLQIP":true,"imageData":{"width":200,"height":200,"uri":"36b510_add4e6997405440e84ad1d367633bcf0~mv2.jpg","name":"ErdosLogoNew2023.jpg","displayMode":"fill"}}' id="img_comp-lr5r73fn__d441214c-80b5-4166-9c1e-b904a6d19089"><img alt="Shahnawaz Khalid" height="100" src="https://static.wixstatic.com/media/36b510_add4e6997405440e84ad1d367633bcf0~mv2.jpg/v1/fill/w_100,h_100,al_c,q_80,usm_0.66_1.00_0.01,blur_3,enc_auto/ErdosLogoNew2023.jpg" style="width:100%;height:100%;object-fit:cover;object-position:50% 50%" width="100"/></wow-image></a>

Note:  we did this in a few minutes, but in real life it took me more than an hour to figure this out.  I went down many false paths to get here.

We can now extract the list of presenters:

In [28]:
# Use a list comprehension to get all of the presenter names.

presenter_names = [t["alt"] for t in past_participants_container.find_all("img")]
print('There are',len(presenter_names), 'presenters')

There are 54 presenters


We can also **try** to extract the list of url links to presenter companies:

In [29]:
# Attempt to extract all of the links using a list comprehension.

presenter_links = [t.parent.parent["href"] for t in past_participants_container.find_all("img")]

KeyError: 'href'

Oh no!  What went wrong?  It looks like at least one of these presenters doesn't have an associated link.  Let's see if we can find out who it is using "try/except".

In [None]:
for t in past_participants_container.find_all("img"):
    try:
        t.parent.parent["href"]
    except:
        print(t['alt'])

Gabriel Tucci


Only one presenter is missing a link.  That presenter is:    Let's process everyone else programmatically and then add her in at the end.


In [None]:
presenter_names = [t["alt"] for t in past_participants_container.find_all("img") if t['alt'] != 'Gabriel Tucci']
presenter_names

['Shahnawaz Khalid',
 'Kari Eifler',
 'Biplav Choudhury',
 'Preethi Raghavan',
 'Julie Niziurski',
 'Daniel Canaday',
 'Dyas Utomo',
 'Mehmet Kaplan',
 'Christopher Dean',
 'Max Glick',
 'Samir Chowdhury',
 'Jim Kloet',
 'Jonathan Viereck',
 'Aidan Zabalo',
 'Felipe Perez',
 'Gregory Barker',
 'Joey Thompson',
 'Alex Karlovitz',
 'Gökçen Büyükbaş',
 'Tuguldur Sukhbold',
 'Lekshmi Nair',
 'Bhargava Nemmaru',
 'Nicole Torosin',
 'Benjamin Campbell',
 'Mae Markowski',
 'Alexander Izaguirre',
 'Alec Clott',
 'Zwick Tang',
 'Sean Meehan',
 'Shuvra Gupta',
 'Brooke Ogrodnik',
 'Sushant More',
 'Sarah Kessler',
 'Frank Seuffert',
 'Lei Ray Zhong',
 'Kyle Dettman',
 'Jessica Nave-Blodgett',
 'In Person Visit!',
 'Sandrine Mueller',
 'Inmaculada Sorribes',
 'Archana Anandakrishnan, PhD',
 'Olivia Walch, PhD',
 'Haile Owusu, PhD',
 'Max Ehrman, PhD',
 'Tigran Ananyan, PhD',
 'Steven Nadler, PhD',
 'Farrah Sadre-Marandi, PhD',
 'Jason Morgan, PhD',
 'Joseph Rossetti, PhD',
 'Benjamin Campbell, Ph

In [None]:
presenter_links = [t.parent.parent["href"] for t in past_participants_container.find_all("img") if t['alt'] != 'Gabriel Tucci']
presenter_links

['https://business.bofa.com/content/boaml/en_us/home.html',
 'https://news.microsoft.com/source/',
 'https://www.linkedin.com/in/biplav-choudhury-66971144/',
 'https://www.fidelity.com/',
 'https://www.pathgrowth.com/',
 'https://www.helm.ai/',
 'https://www.klarna.com/us/',
 'https://www.njit.edu/',
 'https://minedxai.com/',
 'https://www.google.com/about/careers/applications/home',
 'https://www.bearflagrobotics.com/',
 'https://reverb.com/',
 'https://www.intel.com/content/www/us/en/homepage.html',
 'https://www.prudential.com/',
 'https://signal1.ai/',
 'https://www.bms.com/',
 'https://www.sig.com/',
 'https://www.lockheedmartin.com/',
 'https://precise-soft.com/',
 'https://www.radpartners.com/',
 'https://www.usaajobs.com/',
 'https://www.linkedin.com/company/amgen/',
 'http://www.tallyhealth.com',
 'https://www.awarehq.com/',
 'https://www.mathworks.com/',
 'https://www.nychealthandhospitals.org/',
 'https://www.gartner.com/en?utm_medium=social&utm_source=linkedin&utm_campaign=

In [None]:
# Double checking to make sure we have the right number of each.
len(presenter_names), len(presenter_links)

(53, 53)

In [None]:
presenter_names += ['Gabriel Tucci']
presenter_links += ['https://www.citi.com/']

In [None]:
pd.DataFrame.from_dict({'Name': presenter_names, 'link': presenter_links})

Unnamed: 0,Name,link
0,Shahnawaz Khalid,https://business.bofa.com/content/boaml/en_us/...
1,Kari Eifler,https://news.microsoft.com/source/
2,Biplav Choudhury,https://www.linkedin.com/in/biplav-choudhury-6...
3,Preethi Raghavan,https://www.fidelity.com/
4,Julie Niziurski,https://www.pathgrowth.com/
5,Daniel Canaday,https://www.helm.ai/
6,Dyas Utomo,https://www.klarna.com/us/
7,Mehmet Kaplan,https://www.njit.edu/
8,Christopher Dean,https://minedxai.com/
9,Max Glick,https://www.google.com/about/careers/applicati...


## Common problems while web scraping

### Messy or inconsistent HTML code

We have seen one problem that you can encounter while web scraping, small and messy differences in HTML code that make automating your scraping more difficult. It is important to note that the Erdős website is actually not very messy in the grand scheme of the world wide web. For example, you can come across websites that do not label their HTML elements with `id`s or `class`es or any other kind of distinguishing meta data. This makes automation incredibly difficult. Other websites may offer no consistency from page to page. In such cases there may not be a quick or easy fix, you typically just have to hack something together and hope it works.

### Too many requests

Repeatedly sending requests to the same website can raise a flag at the site's server after which your IP address will be blocked from receiving future request results for some period of time. This is why it is good practice to space out your requests to a single website. You can do so with the `sleep` function in the `time` module, <a href="https://docs.python.org/3/library/time.html#time.sleep">https://docs.python.org/3/library/time.html#time.sleep</a>. While this decreases your risk of being flagged as a bot/scraper, it is also just being a good denizen of the internet. Sending too many requests to a single website in a short amount of time can mess with that website's ability to function for other visitors.

### Bot detection

Some websites have been set up to detect bot/scraper activity regardless of the number of times you send a request. Sometimes there are ways around this, but the specific approach depends upon how the website is blocking your request. To counter such detection do a web search for the specific error or response code you are getting and look for a helpful stack overflow or stackexchange post.

### User interactive content

Some of the content on a page may be dependent on the actions of a user visiting that page. For example, there are websites where data tables do not load until the user has clicked a button or scrolled down the page.

#### `selenium`

One way to access information that requires user input is with `selenium`, <a href="https://www.selenium.dev/">https://www.selenium.dev/</a>. `selenium` installation instructions can be found here, <a href="https://pypi.org/project/selenium/">https://pypi.org/project/selenium/</a>, and documentation on how to use the package can be found here, <a href="https://selenium-python.readthedocs.io/index.html">https://selenium-python.readthedocs.io/index.html</a>.

## Summary

In this notebook we touched on how you can parse HTML code with the `bs4` package. We looked at both a simple phony example and an example from a live website. If you are interested in learning more about `bs4` I encourage you to consult their documentation, <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a>.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.  Modified by Steven Gubkin 2024.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)