In [82]:
import requests
from bs4 import BeautifulSoup

## Web Scraping with BS4

BS4 is definitely an older library, so its docs are not as robust or well defined as newer or actively maintained libraries like pandas, but a mixture of introspection and the docs will help you to have a pretty good working knowledge of BS4! It still is however, a robust and essential library for web scrapers even if it doesn't have as many features as a large, large, library like pandas.

**Devpost**<br>
First example from towardsdatascience.com, page does not return all html elements, even with alternate parser like html.parser. Arbitary examples of scraping information have been substituted to replace it.

Tutorial [here](https://towardsdatascience.com/how-to-scrape-any-website-with-python-and-beautiful-soup-bc84e95a3483)<br>
Site to scrape [here](https://devpost.com/hackathons?page=2&search=blockchain)

In [83]:
#Obtain html source and parse with beautifulsoup
result = requests.get(
    "https://devpost.com/hackathons?page=2&search=blockchain")
soup = BeautifulSoup(result.content, 'lxml')

In [84]:
#HTML parsed by lxml is no longer consistent with what the code on towardsdatascience.com is looking for
#print(soup.body.prettify()) #Uncomment to see many divs from the original page are broken or missing
featured_challenges = soup.find_all('a', attrs={'data-role': 'featured_challenge'})
featured_challenges #Returns no results

[]

It would seem that DevPost has either changed their site structure and introduced dynamic elements that the lmxl parser cannot obtain (which contain the data the towardsdatascience tutorial is looking for) or they have changed the name and attributes of their tags so that the naming convention is now different

In [85]:
#Obtain all links in the page and print them out, note the original links for the hackathons we wanted are not listed here
for hyperlink in soup.body.find_all('a'): #Return iterable collection of 'a' tags, or hyperlinks
    link = hyperlink.get('href')          #Find the actual link in href attr
    if link is not None:                  #Print link, but some 'a' tags have no href attr, so avoid printing "None" out
        print(link)

#
https://devpost.com
https://secure.devpost.com/users/login?ref=top-nav-login
https://secure.devpost.com/users/register?ref_content=signup_global_nav&ref_feature=signup&ref_medium=button
https://devpost.com/hackathons
https://devpost.com/software
https://post.devpost.com
https://devpost.com
https://devpost.com/hackathons
https://devpost.com/software
https://post.devpost.com
https://secure.devpost.com/users/login?ref=top-nav-login
https://secure.devpost.com/users/register?ref_content=signup_global_nav&ref_feature=signup&ref_medium=button
https://info.devpost.com/about
https://info.devpost.com/careers
https://info.devpost.com/contact
https://help.devpost.com/
https://devpost.com/hackathons
https://devpost.com/software
https://post.devpost.com
https://post.devpost.com/app_contest_resources/
https://devpost.com/portfolio/redirect?page=projects
https://devpost.com/portfolio/redirect?page=hackathons
https://devpost.com/settings
https://twitter.com/devpost
https://www.facebook.com/devposthac

**HelloHappy**<br>
This example utilizes a very basic static website to avoid the issues we run into with devpost using dynamic elements with JS variants (I believe devpost uses vue.js). HelloHappy only displays different typefaces from Google with sample text, so it primarily text-based.

Site to scrape [here](https://hellohappy.org/beautiful-web-type/)

In [99]:
#Compiling all typeface names
result = requests.get(
    "https://hellohappy.org/beautiful-web-type/")
soup = BeautifulSoup(result.content, 'lxml')
text_container = soup.find('section', {'id': 'container'})
#List comprehension to store all typeface sections, stored in sections titled with class name "sample"
typeface_names = [sample.attrs['id'] for sample in text_container.find_all('section', class_='sample')]

for typeface in typeface_names:
    print(typeface)

title
herzog
bringhurst
nietzsche
tufte
seneca
thin
nabokov
postnormal
slogan
darwin
headline
camus


Getting the sample text from a code perspective is easy enough by invoking the `get_text` method, but what we if shift things to an abstracted user perspective?
Here, we'll prompt the user for a typeface and consequently bring up the sample text by reverse searching. <br>

If we wanted to, we could do away with needing typeface_names, since a `find` or `find_all` query that produces 0 results will return an empty collection, but since we went through the trouble of compiling it, let's check for valid typeface names that way!

In [108]:
query = input("What typeface sample text did you want to bring up?")
if query not in typeface_names:
    print("Sorry, I could not find that typeface in the collection!")
else:
    print("Here is the sample text for typeface " +
          f"{query}:\n{text_container.find('section', {'id' : query}).get_text()}")

What typeface sample text did you want to bring up? thin


Here is the sample text for typeface thin:



Unity
Rhythm
Balance
Emphasis
Proximity
Hierarchy





An admittedly simple, yet effective form of web scraping for a static webpage!