### Comprehensive Python Beautiful Soup Web Scraping Tutorial! (find/find_all, css select, scrape table)
https://www.youtube.com/watch?v=GjKQ6V_ViQE

https://github.com/KeithGalli/web-scraping/blob/master/web_scraping_tutorial.ipynb

https://beautiful-soup-4.readthedocs.io/en/latest/#

Additional Video:
https://www.youtube.com/watch?v=Ewgy-G9cmbg

### Load the necessary libs

In [1]:

import requests
from bs4 import BeautifulSoup as bs

### Load our first page

In [3]:
# load the webpage content
r = requests.get("https://keithgalli.github.io/web-scraping/example.html")

# Convert to a beautiful soup object
soup = bs(r.content)

# print the html
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



### Scraping using BeautifulSoup lib

In [4]:
soup

<html>
<head>
<title>HTML Example</title>
</head>
<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
</html>

In [5]:
first_header = soup.find("h2") # finds the first element that matches the tag
first_header

<h2>A Header</h2>

In [6]:
headers = soup.find_all("h2") # finds all the elements that matches the tag
headers

[<h2>A Header</h2>, <h2>Another header</h2>]

In [8]:
# Pass in a list of elements to look for

first_header_h1_or_h2 = soup.find(["h1", "h2"])
first_header_h1_or_h2

<h1>HTML Webpage</h1>

In [10]:
# Pass in a list of elements to look for

first_header_h1_or_h2 = soup.find(["h2", "h1"]) # order does not matter
first_header_h1_or_h2

## It gets the first element it finds form the list of Tags

<h1>HTML Webpage</h1>

In [11]:
headers_h1_or_h2 = soup.find_all(["h2", "h1"]) # order does not matter. Gets all matching elements.
headers_h1_or_h2

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]

In [12]:
# We can pass attributes to the find/find-all function

para = soup.find_all("p")
para

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [14]:
# Additional filtering using a specific attribute

para = soup.find_all("p", attrs = {"id": "paragraph-id"}) 
para

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [17]:
# You can nest find/find_all calls

body = soup.find('body')
print(body)
print()

div = body.find('div')
print(div)
print()

header = div.find('h1')
print(header)
print()

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>

<h1>HTML Webpage</h1>



In [19]:
# We can search specific strings in our find/find_all calls

## find any paragraph with text 'Some'

paragraphs = soup.find_all("p", string = "Some bold text")  # Requires full and exact String. not useful.
paragraphs

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [21]:
import re   # Regular expressions

paragraphs = soup.find_all("p", string = re.compile("Some"))  # With RegEx lib we can search with partial string
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [23]:
headers = soup.find_all('h2', string = re.compile("(H|h)eader")) # search with both cases of H
headers 

[<h2>A Header</h2>, <h2>Another header</h2>]

### Select method - CSS selector
https://www.w3schools.com/cssref/css_selectors.asp

In [26]:
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



In [24]:
content = soup.select("p")
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [27]:
content = soup.select("div p")   # select p contained in div
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

In [28]:
content = soup.select("body p")  # select p contained in body
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [30]:
paragraphs = soup.select("h2 ~ p") # Get paragraph after h2 tag
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [32]:
bold_text = soup.select("p#paragraph-id b")   # get b contained in: p tag with paragraph-id property
bold_text

[<b>Some bold text</b>]

In [33]:
paragraphs = soup.select("body > p") # paragraphs that are direct descendants of body
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [36]:
for para in paragraphs:
    print(para.select("i"))

[<i>Some italicized text</i>]
[]


In [37]:
# Grab by element with specific property
soup.select("[align = middle]")

[<div align="middle">
 <h1>HTML Webpage</h1>
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
 </div>]

### Get different properties of the HTML

In [38]:
header = soup.find("h2")
header

<h2>A Header</h2>

In [39]:
header.string

'A Header'

In [40]:
header.text

'A Header'

In [41]:
header.get_text

<bound method PageElement.get_text of <h2>A Header</h2>>

In [43]:
# If multiple chield elements presen, then use get_text

div = soup.find("div")
print(div.prettify())
print(div.get_text())

<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>


HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [46]:
# Get a specific property from an element

link = soup.find("a")
print(link)
print(link['href'])

<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>
https://keithgalli.github.io/web-scraping/webpage.html


In [50]:
paragraphs = soup.select("p#paragraph-id")
print(paragraphs[0])
print()
print(paragraphs[0]['id'])

<p id="paragraph-id"><b>Some bold text</b></p>

paragraph-id


In [53]:
## Path syntax

soup.body.div.h1.string

'HTML Webpage'

In [52]:
soup.body.div.h1.text

'HTML Webpage'

In [54]:
# Know the terms: Parent, Sibling, Child
## https://beautiful-soup-4.readthedocs.io/en/latest/#find-next-siblings-and-find-next-sibling
soup.body.find("div")

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>

In [55]:
soup.body.find("div").find_next_siblings()

[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

## Exercises

https://keithgalli.github.io/web-scraping/webpage.html

In [56]:
# Load the webpage content
r = requests.get("https://keithgalli.github.io/web-scraping/webpage.html")

# Convert to a beautiful soup object
webpage = bs(r.content)

# Print out our html
print(webpage.prettify())

<html>
 <head>
  <title>
   Keith Galli's Page
  </title>
  <style>
   table {
    border-collapse: collapse;
  }
  th {
    padding:5px;
  }
  td {
    border: 1px solid #ddd;
    padding: 5px;
  }
  tr:nth-child(even) {
    background-color: #f2f2f2;
  }
  th {
    padding-top: 12px;
    padding-bottom: 12px;
    text-align: left;
    background-color: #add8e6;
    color: black;
  }
  .block {
  width: 100px;
  /*float: left;*/
    display: inline-block;
    zoom: 1;
  }
  .column {
  float: left;
  height: 200px;
  /*width: 33.33%;*/
  padding: 5px;
  }

  .row::after {
    content: "";
    clear: both;
    display: table;
  }
  </style>
 </head>
 <body>
  <h1>
   Welcome to my page!
  </h1>
  <img src="./images/selfie1.jpg" width="300px"/>
  <h2>
   About me
  </h2>
  <p>
   Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
  </p>
  <p>
   Here is a link to my channel:
   <a href="https://www.youtube.com/kgmi

### Exercise #1: Grab all of the social links from the webpage

In [57]:
links = webpage.select("ul.socials a")
actual_links = [link['href'] for link in links]
actual_links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [58]:
ulist = webpage.find("ul", attrs={"class": "socials"})
links = ulist.find_all("a")
actual_links = [link['href'] for link in links]
actual_links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [59]:
links = webpage.select("li.social a")
actual_links = [link['href'] for link in links]
actual_links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [60]:
links = webpage.select("body ul li.social a")
links

[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>,
 <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>,
 <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>,
 <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]

In [61]:
links = webpage.select("li.social a")
links

[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>,
 <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>,
 <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>,
 <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]

### Exercise #2: Grab all text on the webpage

In [63]:
# Just get stuff above the Photos tag


header = webpage.body.find("h2", string="Photos")
previous_elements = header.find_previous_siblings()
previous_elements_sorted = previous_elements[::-1]
elements = [x.get_text() for x in previous_elements_sorted]
text = "\n".join(elements)
print(text)

Welcome to my page!

About me
Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
Here is a link to my channel: youtube.com/kgmit
I grew up in the great state of New Hampshire here in the USA. From an early age I always loved math. Around my senior year of high school, my brother first introduced me to programming. I found it a creative way to apply the same type of logical thinking skills that I enjoyed with math. This influenced me to study computer science in college and ultimately create a YouTube channel to share some things that I have learned along the way.
Hobbies
Believe it or not, I don't code 24/7. I love doing all sorts of active things. I like to play ice hockey & table tennis as well as run, hike, skateboard, and snowboard. In addition to sports, I am a board game enthusiast. The two that I've been playing the most recently are Settlers of Catan and Othello.
Fun Facts

Owned my dream car in high schoo