#### Beautiful Soup Web Scraping Tutorial

Followed the tutorial [here](https://www.youtube.com/watch?v=GjKQ6V_ViQE&ab_channel=KeithGalli) by Keith Galli <br>
BS documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) <br>
CSS Selector Reference [here](https://www.w3schools.com/cssref/css_selectors.asp)

In [2]:
import requests 
from bs4 import BeautifulSoup as bs

In [6]:
# Load WebPage
r = requests.get("https://keithgalli.github.io/web-scraping/example.html")

# Convert to a bs object
soup = bs(r.content)

#print(soup.prettify())

### find & find all

In [13]:
first_header = soup.find('h2')
headers = soup.find_all('h2')

In [18]:
# Pass in a list of elements to look for
# first_header gets the first element of the qualified items
first_header = soup.find(["h1", "h2"])
headers = soup.find_all(["h1", "h2"])


[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]

In [23]:
# pass in attributes to the find/find_all function
# the code below finds the paragraph with a specific id
paragraph = soup.find_all("p"
                          , attrs={"id": "paragraph-id"})
paragraph

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [28]:
# can nest find/find_all calls
# code below defines the body as the entire body, but you can call the find function on the body variable as well and only saves the div
body = soup.find('body')
div = body.find('div')
div

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>

In [35]:
# search specific strings in find/find_all calls

# only looks for a specific text
specific_string = soup.find_all("p", string="Some bold text")

# combine it with regex to find paragraphs that contains a certain pattern
import re
# re.compile compiles a re.Pattern object
paragraphs = soup.find_all("h2", string=re.compile("(H|h)eader"))
paragraphs

[<h2>A Header</h2>, <h2>Another header</h2>]

### selector (CSS Selector)

Usually used when you are selecting things that follow a specific path

In [90]:
#print(soup.body.prettify())

In [43]:
content = soup.select('p')
print(content)
# select the h1 inside div
content = soup.select('div h1')
print(content)
# select paragraphs preceded by header 2 / paragrpahs directly after header 2
paragraphs = soup.select('h2 ~ p')
print(paragraphs)
# select bold text within a paragraph that has a specific id
bold_text = soup.select('p#paragraph-id b')
print(bold_text)


[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<h1>HTML Webpage</h1>]
[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<b>Some bold text</b>]


In [45]:
# Nested calls
# select paragraphs that is the direct descendent of the body
paragraphs = soup.select("body > p")
print(paragraphs)

for paragraph in paragraphs:
    print(paragraph.select("i"))

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


In [46]:
# Grab by element with specific property
soup.select("[align=middle]")

[<div align="middle">
 <h1>HTML Webpage</h1>
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
 </div>]

### Get different properties of the HTML

string, get_text, link


In [48]:
header = soup.find('h2')
print(header.string)

# If multiple child elements, use get_text
# .string doesn't work here bc bs doesn't know if it should print p or header, so it returns none
div = soup.find("div")
#print(div.prettify())
print(div.get_text())


A Header

HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [56]:
# get link
link = soup.find("a")
link['href']

paragraphs = soup.select('p#paragraph-id')
print(paragraphs)
paragraphs[0]['id']

[<p id="paragraph-id"><b>Some bold text</b></p>]


'paragraph-id'

### Code Navigation

path syntax
terms: parent, sibling, child


In [57]:
# path syntax
soup.body.div.h1.string

'HTML Webpage'

In [63]:
# Parent, sibling, child
soup.body.h2.find_next_siblings()

[<p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

### Exercises

webpage: https://keithgalli.github.io/web-scraping/webpage.html


In [74]:
r = requests.get("https://keithgalli.github.io/web-scraping/webpage.html")

webpage = bs(r.content)
#print(webpage.prettify())

##### Grab all of the social weblinks from the page

In [103]:
# method 1
links_with_tags = webpage.select("ul.socials a")

links = [link["href"] for link in links_with_tags]
links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [107]:
# method 2
ul = webpage.find("ul", attrs={"class": "socials"})
links = [link["href"] for link in links_with_tags]
links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [118]:
# method 3 
links_with_tags = webpage.select("li.social a")
links_with_tags
links = [link["href"] for link in links_with_tags]
links


[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>,
 <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>,
 <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>,
 <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]

#### explanation for method 3

Here, the class for social links have 2 keywords separated by space. class = "social instagram" for example. here, it means the instagram link has two class attributes: social and instagram. When using selector, either one works with the . operator.

in other words, to select the soical instagram list item, we can either do ```webpage.select("li.instagram")``` or ```webpage.select("li.social.instagram)```