# Web Scraping Part 2.0
### Parse HTML with Beautifulsoup

Part 2 expands on Part 1, but can handle data other than HTML tables.

This tutorial uses the following Python packages:

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): provides a way to view source code.

Requests: Use GET request to fetch the web page.

Regular expression operations - re

Pages used in this tutorial: 
[How Roller Coasters Work](https://science.howstuffworks.com/engineering/structural/roller-coaster8.htm) 

> Get Libraries

In [42]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

> assign url to variable, type Response

In [43]:
url = input('Enter URL: ')
html_scraped = requests.get(url)
type(html_scraped)

requests.models.Response

> Create a beautifulSoup parse tree.

| HTML Parsers |
  ------------
  | html.parser |
  | html5lib |
    

In [44]:
soup = BeautifulSoup(html_scraped.text, 'html.parser')

> View Data: Put the parse tree into a nested data structure with pretty print.

In [None]:
preti=soup.prettify()
preti

4 Python objects in the parse tree to search: 

- Tag
- NavigableString
- BeautifulSoup
- Comment

> TAGS

Some common methods to navigate the BeautifulSoup parse tree based on tags.

| Approach | Description |
| -------- | ----------- |
| Dot Operator | soup.p |
| String Filter | soup.find_all('p') |
| List Filter | soup.find_all(['p', 'link']) |
| Regular Expressions | Search Strings, CSS class |

dot operator -> bs4

In [46]:
soup.h3

<h3 class="text-xl font-bold" id="modal-headline">
		Cite This!
	</h3>

string filter -> list

In [54]:
soup.find_all('b')


[<b>wooden</b>,
 <b>steel</b>,
 <b>suspended</b>,
 <b>inverted</b>,
 <b>Sit-down:</b>,
 <b>Stand-up:</b>,
 <b>Inverted:</b>,
 <b>Suspended:</b>,
 <b>Pipeline:</b>,
 <b>Bobsled:</b>,
 <b>Flying:</b>,
 <b>Wing:</b>]

list filter -> list

In [55]:
soup.find_all(['h2', 'p'])

[<p class="ad-disclaimer clear-both text-xs text-center mb-1 hidden md:block">Advertisement</p>,
 <p class="mb-4">
 					By: <a class="text-primary" data-track-gtm="Byline" href="https://www.howstuffworks.com/about-author.htm#tom harris">Tom Harris</a> &amp; <a class="text-primary" data-track-gtm="Byline" href="https://www.howstuffworks.com/about-cherise-threewitt.htm#lapine">Cherise Threewitt</a>
 </p>,
 <h2 class="text-2xl mb-8">Types of Roller Coasters</h2>,
 <p>There are two major types of <a href="https://adventure.howstuffworks.com/destinations/theme-parks/12-of-the-worlds-greatest-roller-coasters.htm">roller coasters</a>, distinguished mainly by their track structure. The tracks of <b>wooden</b> roller coasters are similar to traditional railroad tracks. In most coasters, the car wheels have the same flanged design as the wheels of a train; the inner part of the wheel has a wide lip that keeps the car from rolling off the side of the track. The car also has another set of wheels

other filters

In [65]:
btag = soup.find_all('b')
bpara = [ b.parent for b in btag]
bpara

[<p>There are two major types of <a href="https://adventure.howstuffworks.com/destinations/theme-parks/12-of-the-worlds-greatest-roller-coasters.htm">roller coasters</a>, distinguished mainly by their track structure. The tracks of <b>wooden</b> roller coasters are similar to traditional railroad tracks. In most coasters, the car wheels have the same flanged design as the wheels of a train; the inner part of the wheel has a wide lip that keeps the car from rolling off the side of the track. The car also has another set of wheels (or sometimes just a safety bar) that runs underneath the track. This keeps the cars from flying up into the air.</p>,
 <p>The range of motion is greatly expanded in <b>steel</b> roller coasters. The world of roller coasters changed radically with the introduction of tubular steel tracks in the 1950s. As the name suggests, these tracks consist of a pair of long steel tubes. These tubes are supported by a sturdy, lightweight superstructure made of slightly large

## Web Scraping with BeautifulSoup Part 2.1

Scraping and working with text data.
- parse tree and string(text) format.

- filtering the text for strings and CSS class with regex. 

Find tags that contain certain letters. Using ^ find tags that begin with said letter.

In [71]:
soup.find_all(re.compile('^b'))

[<body class="hsw-page theme-hsw pt-14 md:pt-0 leaderboard-sticky science cid-10802 interior editorial-content editorial article-template paginated-template page-8">
 <noscript><iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-NXHP8V" style="display:none;visibility:hidden" width="0"></iframe></noscript>
 <div class="fixed top-0 w-full md:relative z-1050 print:hidden" id="header" x-data="{ showMobileMenu : false, showMobileSearch : false, showNewsletterBanner : false, showNewsletterChatbox : false }">
 <header class="bg-royal-blue text-white max-w-full h-14 md:h-20" data-track-gtm="Header">
 <div class="w-full md:w-11/12 w-max-1600 mx-auto px-4 md:px-2">
 <div class="flex justify-between items-center h-14 md:h-20">
 <div class="shrink md:hidden self-center">
 <a :aria-expanded="showMobileMenu.toString()" @click.prevent="showMobileMenu = !showMobileMenu" aria-controls="mobile-nav" class="text-white hover:text-white focus:text-green" data-cy="mobile-menu-icon" href="