### Discussion Week 3

We will review an example that highlights the need of being proficient in xpath syntax, because we are not able to inspect the html using devtools. 

Consider the website [`https://imsdb.com/`](https://imsdb.com/). We want to scrape the links that are in the _Genre_ sidebar. Using devtools, we can inspect this element and find that its a child of `table/tbody`. 

In [None]:
import requests
import lxml.html as lx

In [None]:
result = requests.get('https://imsdb.com/')
result.raise_for_status

In [None]:
html = lx.fromstring(result.text)

In [None]:
html.xpath('//table/tbody')

The list is empty. Copying the xpath from devtools doesn't help either. Apparently, the html that the requests returns is not the same as the one rendered by Google Chrome. We can inspect whatever is being returned by checking the _Networks_ tab. 

While re-loading the webpage to monitor the communication in the _Networks_ tab, we note see that the html (for smaller dimensions which can be set in the upper left corner) is now rendered for mobile use. The sidebar with _Genre_ section is now missing. Going back to inspecting the html we see, that the genres are now listed as dropdown menu. The dropdown menu does not contain links, those are generated by a script. 

Back to the network tab! Cycling through all requests, we find that the html is returned as `Document`, but no other data is transferred. Lets inspect the request, and navigate to its _Response_ tab. We can search it for the string `Genres`. We find three instances, but all preparing the script, none containing the links. While dealing with scripts was presented in todays lecture, we should adjust the dimensions (upper left corner) to something larger (e.g., _Nest Hub Max_). 

A new request will now return a different html. Searching for the string `Genre` will now find the corresponding table, its in a different element structure as in our first attempt. However, ... 

In [None]:
html.xpath('//td[text()="Genre"]')

Some whitespace characters prevent us from finding the element! (Direct inspection of `request.text` shows that its `"Genre\r\n"`! 

In [None]:
html.xpath('//td[contains(text(), "Genre")]') 

Now, how to get the correct anchors? 

In [None]:
html.xpath('//table[tr/td[contains(text(), "Genre")]]/tr//a/@href') 

Perfect! Now, consider the [_Interstellar_](https://imsdb.com/Movie%20Scripts/Interstellar%20Script.html) page. We want to retrieve the movie release year. After inspecting the html (it might not be accurate!), we find that the date is the content of a `<td>` element, but is cluttered between a variety of other elements. 

In [None]:
result = requests.get('https://imsdb.com/Movie%20Scripts/Interstellar%20Script.html')
result.raise_for_status

In [None]:
html = lx.fromstring(result.text)

In [None]:
html.xpath('//table[@class="script-details"]//td/text()') 

Its there, but how to we retrieve the correct element text? 

In [None]:
html.xpath('//b[text() = "Script Date"]/following-sibling::text()[1]')

From here, we will use regular expressions to extract the digits of the year. We will learn about regular expressions next week. In the meantime, become an xpath [ninja](https://topswagcode.com/xpath/)!

# Beautiful Soup

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for navigating, searching, and modifying the parse tree.

Beautiful Soup is documented [here](https://tedboy.github.io/bs4_doc/index.html).

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
page = """
<html>
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p id="best-paragraph">This is a paragraph!</p>
    <p class="important">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span class="important">This is a span, it comes with an taco &#127790;</span>
</body>
</html>
""" 

Elements are nested, so an HTML document is like a tree:
```
html
├── head
│   └── title
└── body
    ├── p
    ├── p
    ├── p
    │   └── a
    └── span
```

## 1 Making the soup

To parse a document, pass it into the `BeautifulSoup` constructor. The `BeautifulSoup` object represents the parsed document as a whole.  Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. 

In [None]:
page_soup = BeautifulSoup(page, "html.parser") # parse the html
type(page_soup)

## 2 Navigating the tree

### Navigating using the tag types

In [None]:
page_soup.head

In [None]:
page_soup.head.title

Using a tag type for navigation will give you only the **first** tag of that type.

In [None]:
page_soup.p

### Going down

A tag's children include the strings and the tags nested inside. 

### .contents

`.contents` returns the children of a tag in a list.

In [None]:
page_soup.body.contents

You can iterate over all of a tag's children with `.children`. 

In [None]:
for child in page_soup.body.children:
    print(child)

### Going up

You can access a tag's parent with the `.parent` attribute.

In [None]:
page_soup.title.parent

## 3 Searching the tree

Beautiful Soup defines a lot of methods for searching the parse tree. By passing in a filter to the searching methods, you can zoom in on the parts of the document you are interested in.

### .find_all()

The `.find_all()` method looks through the parse tree or a tag’s descendants and retrieves **all** elements that match your filters.

In [None]:
# search by tag type
page_soup.find_all(name = "p") # find all <p> tags

In [None]:
# seach by attribute keyword
page_soup.find_all(id = "best-paragraph") 

In [None]:
page_soup.find_all(class_ = "important") # `class_` not `class`!!!

In [None]:
# seach by attribute dictionary
page_soup.find_all(attrs = {"class": "important"})

### .find()

The `.find()` method looks through the parse tree or a tag’s descendants and retrieves the **first** element that matches your filters.

In [None]:
# search by tag type
page_soup.find(name = "title") 

In [None]:
# search by attribute keyword
page_soup.find(class_ = "important") # return the first tag with specified class attribute

In [None]:
# search by attribute dictionary
page_soup.find(attrs = {"class": "important"}) # find the first tag with the specified content attribute

### CSS selector

`BeautifulSoup` has a `.select()` method which runs a CSS selector against a parsed document or a single tag and returns all the matching elements.

In [None]:
page_soup.select("p") # find all <p> tags

In [None]:
page_soup.select("p#best-paragraph")

In [None]:
page_soup.select("p.important")

## 4 Contents and Attributes

### .get_text()

`.get_text()` returns all the text in a document or beneath a tag.

In [None]:
page_soup.body.get_text()

### Attributes

In [None]:
page_soup.p

We can access a tag’s attributes by treating the tag like a dictionary.

In [None]:
page_soup.p["id"]

In [None]:
page_soup.p.get("id")

We can access the tag's attribute dictionary using `.attrs`.

In [None]:
page_soup.p.attrs

## 5 Output

The `.prettify()` method will turn a Beautiful Soup parse tree or a tag into a nicely formatted Unicode string, with a separate line for each tag and each string.

In [None]:
print(page_soup.prettify()) # pretty-print the parsed document

In [None]:
print(page_soup.body.prettify()) # pretty-print the <body> tag

# Example: National Weather Service

Let's scrape the [National Weather Service](https://weather.gov/) for the weather forecast of Davis, CA.

In [None]:
url = "https://forecast.weather.gov/MapClick.php?lat=38.54669000000007&lon=-121.74456999999995#.Y9fY5vv565t"

response = requests.get(url)
response.raise_for_status()

In [None]:
html_soup = BeautifulSoup(response.text, "html.parser") # parse the html

In [None]:
seven_day = html_soup.find(id = "seven-day-forecast-container")
print(seven_day.prettify())

In [None]:
# find the time periods of the weather forecast
period_names = seven_day.find_all("p", class_ = "period-name")
period = [name.get_text() for name in period_names]
period

In [None]:
# find the weather descriptions
descs = seven_day.find_all("p", {"class": "short-desc"})
description = [desc.get_text() for desc in descs]
description

In [None]:
# find the temperatures
temps = seven_day.select("p[class *= 'temp']") # css selector
temperature = [temp.get_text() for temp in temps]
temperature

In [None]:
# find the detailed weather descriptions
images = seven_day.select("div.tombstone-container img") # css selector
details = [image.attrs["title"] for image in images]
details

In [None]:
details[1].partition(":")[2] # remove the time period at the front

In [None]:
details[1].partition(":")[2].strip() # remove the leading and trailing white spaces

In [None]:
new_details = [detail.partition(":")[2].strip() for detail in details]
new_details

In [None]:
weather = pd.DataFrame({"Period": period,
                        "Description": description,
                        "Temperature": temperature,
                        "Detail": new_details})
weather