## Introduction 

In Data Science, we can do a lot of exciting work with the right dataset. Once we have interesting data, we can use Pandas or Matplotlib to analyze or visualize trends. But how do we get that data in the first place?

If it’s provided to us in a well-organized csv or json file, we’re lucky! Most of the time, we need to go out and search for it ourselves.

Often times you’ll find the perfect website that has all the data you need, but there’s no way to download it. This is where BeautifulSoup comes in handy to scrape the HTML. If we find the data we want to analyze online, we can use BeautifulSoup to grab it and turn it into a structure we can understand. This Python library, allows us to easily and quickly take information from a website and put it into a DataFrame.

## Rules of Scraping

When we scrape websites, we have to make sure we are following some guidelines so that we are treating the websites and their owners with respect.

Always check a website’s Terms and Conditions before scraping. Read the statement on the legal use of data. Usually, the data you scrape should not be used for commercial purposes.

Do not spam the website with a ton of requests. A large number of requests can break a website that is unprepared for that level of traffic. As a general rule of good practice, make one request to one webpage per second.

If the layout of the website changes, you will have to change your scraping code to follow the new structure of the site.

## Requests

In order to get the HTML of the website, we need to make a request to get the content of the webpage. 

Python has a requests library that makes getting content really easy. All we have to do is import the library, and then feed in the URL we want to GET:

import requests

webpage = requests.get('https://www.codecademy.com/articles/http-requests')

print(webpage.text)

This code will print out the HTML of the page.

![r1](https://i.imgur.com/CDkzv5A.jpg)

## The BeautifulSoup Object

When we printed out all of that HTML from our request, it seemed pretty long and messy. How could we pull out the relevant information from that long string?

BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in. We can import it by using the line:

from bs4 import BeautifulSoup

Then, all we have to do is convert the HTML document to a BeautifulSoup object!

If this is our HTML file, rainbow.html:

![bs1](https://i.imgur.com/nHExCMm.jpg)

"html.parser" is one option for parsers we could use. There are other options, like "lxml" and "html5lib" that have different advantages and disadvantages, but for our purposes we will be using "html.parser" throughout.

With the requests skills we just learned, we can use a website hosted online as that HTML:

webpage = requests.get("http://rainbow.com/rainbow.html", "html.parser")

soup = BeautifulSoup(webpage.content)

When we use BeautifulSoup in combination with pandas, we can turn websites into DataFrames that are easy to manipulate and gain insights from.

![bs2](https://i.imgur.com/NZ6mEDk.jpg)

## Object Types

BeautifulSoup breaks the HTML page into several types of objects.

### Tags

![obt2](https://i.imgur.com/QqhRZ5f.jpg)

div

{'id': 'example'}

### NavigableStrings

NavigableStrings are the pieces of text that are in the HTML tags on the page. You can get the string inside of the tag by calling .string:

print(soup.div.string)

An example div

![obt](https://i.imgur.com/HwewhtK.jpg)

## Navigating by Tags

![tg1](https://i.imgur.com/72kCwOi.jpg)

![tg2](https://i.imgur.com/XaBxTuw.jpg)

![tg3](https://i.imgur.com/6mJmbGv.jpg)

![tg4](https://i.imgur.com/SzD7N41.jpg)

## Website Structure

When we’re telling our Python script what HTML tags to grab, we need to know the structure of the website and what we’re looking for.

When you’re preparing to scrape a website, first inspect the HTML to see where the info you are looking for is located on the page.

## Find All

If we want to find all of the occurrences of a tag, instead of just the first one, we can use .find_all().

This function can take in just the name of a tag and returns a list of all occurrences of that tag.

![fa1](https://i.imgur.com/cvpIE3r.jpg)

![fa2](https://i.imgur.com/k5diJik.jpg)

![fa3](https://i.imgur.com/gclWjSr.jpg)

### Using A Function

If our selection starts to get really complicated, we can separate out all of the logic that we’re using to choose a tag into its own function. Then, we can pass that function into .find_all()!

![fa4](https://i.imgur.com/8gMJUzw.jpg)

![fa5](https://i.imgur.com/67DbwjX.jpg)

## Select for CSS Selectors

![cs1](https://i.imgur.com/Y0uvenZ.jpg)

![cs2](https://i.imgur.com/Fu2Ds5Z.jpg)

![cs3](https://i.imgur.com/K9RAs23.jpg)

![cs4](https://i.imgur.com/bn0DhIs.jpg)

## Reading Text

![rt3](https://i.imgur.com/CtPebCe.jpg)

![rt4](https://i.imgur.com/A6MRMZi.jpg)

![rt1](https://i.imgur.com/mLE2NXe.jpg)

![rt2](https://i.imgur.com/DEIO56L.jpg)

## Review

![re1](https://i.imgur.com/rCKZxuq.jpg)

![re2](https://i.imgur.com/oc4QafY.jpg)

## Quiz

![q1](https://i.imgur.com/ICBUbI2.jpg)

![q2](https://i.imgur.com/qjgnHmy.jpg)

![q3](https://i.imgur.com/PVUplgV.jpg)

![q4](https://i.imgur.com/f6l3DUY.jpg)

![q5](https://i.imgur.com/pH1aZml.jpg)

![q6](https://i.imgur.com/FJI92fP.jpg)