In [1]:
#pip install jupyter-black
import jupyter_black
jupyter_black.load()

### Requests

In order to get the HTML of the website, we need to make a request to get the content of the webpage. To learn more about requests in a general sense, you can check out this article.

Python has a requests library that makes getting content really easy. All we have to do is import the library, and then feed in the URL we want to GET:
```
import requests

webpage = requests.get('https://www.codecademy.com/articles/http-requests')
print(webpage.text)
```

requests.get(address) creates a response object
we'll use mostly the .content or the .text in websraping
there is an important difference between them:

    Use r.text for textual responses (like HTML or XML).
    Use r.content for binary filetypes (such as images or PDFs) or when you need the raw byte stream.


In [None]:
import requests

address = "https://content.codecademy.com/courses/beautifulsoup/shellter.html"

page_response = requests.get(address)
#print(page_response)
webpage = page_response.content
#print(page_response.text)
#print(page_response.content)
#print(page_response.headers)
#print(page_response.encoding)

### The BeautifulSoup Object

BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in. We can import it by using the line:

`from bs4 import BeautifulSoup`

Then, all we have to do is convert the HTML document to a BeautifulSoup object!

`soup = BeautifulSoup("rainbow.html", "html.parser")`

"html.parser" is one option for parsers we could use. 
There are other options: "lxml" , "html5lib" 

```
webpage = requests.get("http://rainbow.com/rainbow.html")
soup = BeautifulSoup(webpage.content, "html.parser")
```
When we use BeautifulSoup in combination with pandas, we can turn websites into DataFrames that are easy to manipulate and gain insights from.



### Object types in BS
---

#### Tags

(p, div, h, ul, li, a, thead, tbody tfoot, tr, th, td, etc...)
We can access the different tags by putting the tag name as a method after the soup object (soup.div)
With this call you can access the first such tag on the page.
The information within this object can be reaches as in a dictionary:

- name of the tag: **soup.div.name**
- attributes of the tag: **soup.div.attrs**
- the axtual text within the tag: **soup.div.string**


#### Navigating by tags

you can access the children of a tag: `soup.ul.children`
or you can access the parents of a child tag: `soup.li.parents`


#### Find

- .find() - returns the first tag or none if there is no such tag
- -find_all() - returns a list of all the tags or an empty list

#### Using Regex

What if we want every <ol> and every <ul> that the page contains? We will use the .compile() function from the re module. We will use the regex: "[ou]l" which means “match either o or u and l“.

We can select both of these types of elements with a regex in our .find_all():
```
import re
soup.find_all(re.compile("[ou]l"))
```
What if we want all of the h1 - h9 tags that the page contains? Regex to the rescue again! The expression "h[1-9]" means h and any number between 1 and 9.
```
import re
soup.find_all(re.compile("h[1-9]"))
```

we can use lists in find_all instead of just a single attribute:
`soup.find_all(['h1', 'a', 'p'])`

find_all has an attrs parameter where we can specify dictionaries:
`soup.find_all(attrs={'class':'banner'})`
`soup.find_all(attrs={'class':'banner', 'id':'jumbotron'})`

we can even pass a function into find_all if the selection gets really complicated:
```
def has_banner_class_and_hello_world(tag):
    return tag.attr('class') == "banner" and tag.string == "Hello world"

soup.find_all(has_banner_class_and_hello_world)
```


#### CSS Selectors .select()

Select for CSS Selectors

Another way to capture your desired elements with the soup object is to use CSS selectors. The .select() method will take in all of the CSS selectors you normally use in a .css file!
```
<h1 class='results'>Search Results for: <span class='searchTerm'>Funfetti</span></h1>
<div class='recipeLink'><a href="spaghetti.html">Funfetti Spaghetti</a></div>
<div class='recipeLink' id="selected"><a href="lasagna.html">Lasagna de Funfetti</a></div>
<div class='recipeLink'><a href="cupcakes.html">Funfetti Cupcakes</a></div>
<div class='recipeLink'><a href="pie.html">Pecan Funfetti Pie</a></div>
```
If we wanted to select all of the elements that have the class 'recipeLink', we could use the command:

`soup.select(".recipeLink")`

If we wanted to select the element that has the id 'selected', we could use the command:

`soup.select("#selected")`

Let’s say we wanted to loop through all of the links to these funfetti recipes that we found from our search.
```
for link in soup.select(".recipeLink > a"):
  webpage = requests.get(link)
  new_soup = BeautifulSoup(webpage)
```
This loop will go through each link in each .recipeLink div and create a soup object out of the webpage it links to. So, it would first make soup out of <a href="spaghetti.html">Funfetti Spaghetti</a>, then <a href="lasagna.html">Lasagna de Funfetti</a>, and so on.


In [2]:
import requests
from bs4 import BeautifulSoup

prefix = "https://content.codecademy.com/courses/beautifulsoup/"
webpage_response = requests.get(
    "https://content.codecademy.com/courses/beautifulsoup/shellter.html"
)

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")
links = []
# go through all of the a tags and get the links associated with them.
# additionally extract the href from the tag and append it to the page prefix to build a link to follow:
for a in turtle_links:
    links.append(prefix + a["href"])

# Define turtle_data to store the data we gather when we follow the links:
turtle_data = dict()
# follow each link and create a new soup object for each
for link in links:
    webpage = requests.get(link)
    turtle = BeautifulSoup(webpage.content, "html.parser")
    # the select will return a list, we need to take the 0th  index and use the .text to get the actual string
    turtle_name = turtle.select("p.name")[0].text
    turtle_data[turtle_name] = []

### Reading text

To get the text from a tag we can use the .grt_text() method
It will readd all the strings within the tag into a single string.
If we want to keep the strings separated we can specify a separator: `.get_text('|')`




In [None]:
import requests
from bs4 import BeautifulSoup

prefix = "https://content.codecademy.com/courses/beautifulsoup/"
webpage_response = requests.get(
    "https://content.codecademy.com/courses/beautifulsoup/shellter.html"
)

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")
links = []
# go through all of the a tags and get the links associated with them:
for a in turtle_links:
    links.append(prefix + a["href"])

# Define turtle_data:
turtle_data = {}

# follow each link and create a new soup for each
for link in links:
    webpage = requests.get(link)
    turtle = BeautifulSoup(webpage.content, "html.parser")
    # parse the name of the turtle based on the class
    turtle_name = turtle.select(".name")[0].get_text()
    # extract the turtle details from the li tags
    turtle_details = turtle.find_all("li")
    turtle_details_text = []
    # as the find_all() return object is a list we need to loop through and extract the text from each element
    for detail in turtle_details:
        turtle_details_text.append(detail.get_text())
    turtle_data[turtle_name] = turtle_details_text
print(turtle_data)

In [5]:
# we can create a dataframe from a dict to store this turtle data
# it will need cleaning and transformation though

import pandas as pd

turtle_df = pd.DataFrame.from_dict(turtle_data, orient="index")
print(turtle_df)

                            0                1            2  \
Aesop        AGE: 7 Years Old    WEIGHT: 6 lbs  SEX: Female   
Caesar       AGE: 2 Years Old    WEIGHT: 4 lbs    SEX: Male   
Sulla         AGE: 1 Year Old     WEIGHT: 1 lb    SEX: Male   
Spyro        AGE: 6 Years Old    WEIGHT: 3 lbs  SEX: Female   
Zelda        AGE: 3 Years Old    WEIGHT: 2 lbs  SEX: Female   
Bandicoot    AGE: 2 Years Old    WEIGHT: 2 lbs    SEX: Male   
Hal           AGE: 1 Year Old  WEIGHT: 1.5 lbs  SEX: Female   
Mock        AGE: 10 Years Old   WEIGHT: 10 lbs    SEX: Male   
Sparrow    AGE: 1.5 Years Old  WEIGHT: 4.5 lbs  SEX: Female   

                                                3  \
Aesop      BREED: African Aquatic Sideneck Turtle   
Caesar                      BREED: Greek Tortoise   
Sulla      BREED: African Aquatic Sideneck Turtle   
Spyro                       BREED: Greek Tortoise   
Zelda                   BREED: Eastern Box Turtle   
Bandicoot  BREED: African Aquatic Sideneck Turtle  