### Discussion Week 8

We will review an example that highlights the need of being proficient in xpath syntax, because we are not able to inspect the html using devtools. 

Consider the website [`https://imsdb.com/`](https://imsdb.com/). We want to scrape the links that are in the _Genre_ sidebar. Using devtools, we can inspect this element and find that its a child of `table/tbody`. 

In [1]:
import requests
import lxml.html as lx

In [2]:
result = requests.get('https://imsdb.com/')
result.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [3]:
html = lx.fromstring(result.text)

In [4]:
html.xpath('//table/tbody')

[]

The list is empty. Copying the xpath from devtools doesn't help either. Apparently, the html that the requests returns is not the same as the one rendered by Google Chrome. We can inspect whatever is being returned by checking the _Networks_ tab. 

While re-loading the webpage to monitor the communication in the _Networks_ tab, we note see that the html (for smaller dimensions which can be set in the upper left corner) is now rendered for mobile use. The sidebar with _Genre_ section is now missing. Going back to inspecting the html we see, that the genres are now listed as dropdown menu. The dropdown menu does not contain links, those are generated by a script. 

Back to the network tab! Cycling through all requests, we find that the html is returned as `Document`, but no other data is transferred. Lets inspect the request, and navigate to its _Response_ tab. We can search it for the string `Genres`. We find three instances, but all preparing the script, none containing the links. While dealing with scripts was presented in todays lecture, we should adjust the dimensions (upper left corner) to something larger (e.g., _Nest Hub Max_). 

A new request will now return a different html. Searching for the string `Genre` will now find the corresponding table, its in a different element structure as in our first attempt. However, ... 

In [5]:
html.xpath('//td[text()="Genre"]')

[]

Some whitespace characters prevent us from finding the element! (Direct inspection of `result.text` shows that its `"Genre\r\n"`! 

In [6]:
result.text

'<html>\r\n<head>\r\n<!-- Google tag (gtag.js) -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=G-W5BXG8HCH3"></script>\r\n<script>\r\n  window.dataLayer = window.dataLayer || [];\r\n  function gtag(){dataLayer.push(arguments);}\r\n  gtag(\'js\', new Date());\r\n\r\n  gtag(\'config\', \'G-W5BXG8HCH3\');\r\n</script>\r\n\r\n<title>The Internet Movie Script Database (IMSDb)</title>\r\n<meta name="description" content="Movie scripts, Film scripts at IMSDb">\r\n<meta name="keywords" content="Scripts, Movie scripts, Film scripts.">\r\n<meta name="viewport" content="width=device-width, initial-scale=1" />\r\n<meta name="HandheldFriendly" content="true">\r\n<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">\r\n<meta http-equiv="Content-Language" content="EN">\r\n\r\n<meta name=objecttype CONTENT=Document>\r\n<meta name=ROBOTS CONTENT="INDEX, FOLLOW">\r\n<meta name=Subject CONTENT="Movie scripts, Film scripts">\r\n<meta name=rating CONTENT=General>\r\

In [7]:
html.xpath('//td[contains(text(), "Genre")]') 

[<Element td at 0x7f8415a15860>]

Now, how to get the correct anchors? 

In [8]:
html.xpath('//table[tr/td[contains(text(), "Genre")]]/tr//a/@href') 

['/genre/Action',
 '/genre/Adventure',
 '/genre/Animation',
 '/genre/Comedy',
 '/genre/Crime',
 '/genre/Drama',
 '/genre/Family',
 '/genre/Fantasy',
 '/genre/Film-Noir',
 '/genre/Horror',
 '/genre/Musical',
 '/genre/Mystery',
 '/genre/Romance',
 '/genre/Sci-Fi',
 '/genre/Short',
 '/genre/Thriller',
 '/genre/War',
 '/genre/Western']

Perfect! Now, consider the [_Interstellar_](https://imsdb.com/Movie%20Scripts/Interstellar%20Script.html) page. We want to retrieve the script date. After inspecting the html (it might not be accurate!), we find that the date is the content of a `<td>` element, but is cluttered between a variety of other elements. 

In [9]:
result = requests.get('https://imsdb.com/Movie%20Scripts/Interstellar%20Script.html')
result.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [10]:
html = lx.fromstring(result.text)

In [11]:
html.xpath('//table[@class="script-details"]//td/text()') 

['\r\n      ',
 '\xa0\xa0Excellent.',
 '\r\n      ',
 '\xa0\xa0',
 ' (10 out of 10)',
 '\r\n\t  ',
 '\xa0\xa0',
 ' (9.50 out of 10)',
 '\r\n\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n\t  \r\n      ',
 '\r\n      ',
 '\xa0\xa0',
 '\r\n      ',
 '\xa0\xa0',
 '\xa0\xa0',
 '\xa0\xa0',
 '\r\n\t ',
 ' : March 2008',
 '\r\n\t ',
 ' : November 2014',
 '\r\n\t \r\n',
 '\r\n',
 '\r\n']

Its there, but how to we retrieve the correct element text? 

In [12]:
html.xpath('//b[text() = "Script Date"]/following-sibling::text()[1]')

[' : March 2008']

From here, we will use regular expressions to extract the digits of the year. We will learn about regular expressions next week. In the meantime, become an xpath [ninja](https://topswagcode.com/xpath/)!

# Beautiful Soup

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for navigating, searching, and modifying the parse tree.

Beautiful Soup is documented [here](https://tedboy.github.io/bs4_doc/index.html).

In [13]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [14]:
page = """
<html>
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p id="best-paragraph">This is a paragraph!</p>
    <p class="important">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span class="important">This is a span, it comes with an taco &#127790;</span>
</body>
</html>
""" 

Elements are nested, so an HTML document is like a tree:
```
html
├── head
│   └── title
└── body
    ├── p
    ├── p
    ├── p
    │   └── a
    └── span
```

## 1 Making the soup

To parse a document, pass it into the `BeautifulSoup` constructor. The `BeautifulSoup` object represents the parsed document as a whole.  Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. 

In [15]:
page_soup = BeautifulSoup(page, "html.parser") # parse the html
type(page_soup)

bs4.BeautifulSoup

## 2 Navigating the tree

### Navigating using the tag types

In [16]:
page_soup.head

<head>
<title>This is the Title!</title>
</head>

In [17]:
page_soup.head.title

<title>This is the Title!</title>

Using a tag type for navigation will give you only the **first** tag of that type.

In [18]:
page_soup.p

<p id="best-paragraph">This is a paragraph!</p>

### Going down

A tag's children include the strings and the tags nested inside. 

### .contents

`.contents` returns the children of a tag in a list.

In [19]:
page_soup.body.contents

['\n',
 <p id="best-paragraph">This is a paragraph!</p>,
 '\n',
 <p class="important">This is another paragraph! 🌮</p>,
 '\n',
 <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>,
 '\n',
 <span class="important">This is a span, it comes with an taco 🌮</span>,
 '\n']

You can iterate over all of a tag's children with `.children`. 

In [20]:
for child in page_soup.body.children:
    print(child)



<p id="best-paragraph">This is a paragraph!</p>


<p class="important">This is another paragraph! 🌮</p>


<p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>


<span class="important">This is a span, it comes with an taco 🌮</span>




### Going up

You can access a tag's parent with the `.parent` attribute.

In [21]:
page_soup.title.parent

<head>
<title>This is the Title!</title>
</head>

## 3 Searching the tree

Beautiful Soup defines a lot of methods for searching the parse tree. By passing in a filter to the searching methods, you can zoom in on the parts of the document you are interested in.

### .find_all()

The `.find_all()` method looks through the parse tree or a tag’s descendants and retrieves **all** elements that match your filters.

In [22]:
# search by tag type
page_soup.find_all(name = "p") # find all <p> tags

[<p id="best-paragraph">This is a paragraph!</p>,
 <p class="important">This is another paragraph! 🌮</p>,
 <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>]

In [23]:
# seach by attribute keyword
page_soup.find_all(id = "best-paragraph") 

[<p id="best-paragraph">This is a paragraph!</p>]

In [24]:
page_soup.find_all(class_ = "important") # `class_` not `class`!!!

[<p class="important">This is another paragraph! 🌮</p>,
 <span class="important">This is a span, it comes with an taco 🌮</span>]

In [25]:
# seach by attribute dictionary
page_soup.find_all(attrs = {"class": "important"})

[<p class="important">This is another paragraph! 🌮</p>,
 <span class="important">This is a span, it comes with an taco 🌮</span>]

### .find()

The `.find()` method looks through the parse tree or a tag’s descendants and retrieves the **first** element that matches your filters.

In [26]:
# search by tag type
page_soup.find(name = "title") 

<title>This is the Title!</title>

In [27]:
# search by attribute keyword
page_soup.find(class_ = "important") # return the first tag with specified class attribute

<p class="important">This is another paragraph! 🌮</p>

In [28]:
# search by attribute dictionary
page_soup.find(attrs = {"class": "important"}) # find the first tag with the specified content attribute

<p class="important">This is another paragraph! 🌮</p>

### CSS selector

`BeautifulSoup` has a `.select()` method which runs a CSS selector against a parsed document or a single tag and returns all the matching elements.

In [29]:
page_soup.select("p") # find all <p> tags

[<p id="best-paragraph">This is a paragraph!</p>,
 <p class="important">This is another paragraph! 🌮</p>,
 <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>]

In [30]:
page_soup.select("p#best-paragraph")

[<p id="best-paragraph">This is a paragraph!</p>]

In [31]:
page_soup.select("p.important")

[<p class="important">This is another paragraph! 🌮</p>]

## 4 Contents and Attributes

### .get_text()

`.get_text()` returns all the text in a document or beneath a tag.

In [32]:
page_soup.body.get_text()

'\nThis is a paragraph!\nThis is another paragraph! 🌮\nVisit The Pudding.\nThis is a span, it comes with an taco 🌮\n'

### Attributes

In [33]:
page_soup.p

<p id="best-paragraph">This is a paragraph!</p>

We can access a tag’s attributes by treating the tag like a dictionary.

In [34]:
page_soup.p["id"]

'best-paragraph'

In [35]:
page_soup.p.get("id")

'best-paragraph'

We can access the tag's attribute dictionary using `.attrs`.

In [36]:
page_soup.p.attrs

{'id': 'best-paragraph'}

## 5 Output

The `.prettify()` method will turn a Beautiful Soup parse tree or a tag into a nicely formatted Unicode string, with a separate line for each tag and each string.

In [37]:
print(page_soup.prettify()) # pretty-print the parsed document

<html>
 <head>
  <title>
   This is the Title!
  </title>
 </head>
 <body>
  <p id="best-paragraph">
   This is a paragraph!
  </p>
  <p class="important">
   This is another paragraph! 🌮
  </p>
  <p>
   Visit
   <a href="https://pudding.cool">
    The Pudding
   </a>
   .
  </p>
  <span class="important">
   This is a span, it comes with an taco 🌮
  </span>
 </body>
</html>



In [38]:
print(page_soup.body.prettify()) # pretty-print the <body> tag

<body>
 <p id="best-paragraph">
  This is a paragraph!
 </p>
 <p class="important">
  This is another paragraph! 🌮
 </p>
 <p>
  Visit
  <a href="https://pudding.cool">
   The Pudding
  </a>
  .
 </p>
 <span class="important">
  This is a span, it comes with an taco 🌮
 </span>
</body>



# Example: National Weather Service

Let's scrape the [National Weather Service](https://weather.gov/) for the weather forecast of Davis, CA.

In [39]:
url = "https://forecast.weather.gov/MapClick.php?lat=38.54669000000007&lon=-121.74456999999995#.Y9fY5vv565t"

response = requests.get(url)
response.raise_for_status()

In [40]:
html_soup = BeautifulSoup(response.text, "html.parser") # parse the html

In [41]:
seven_day = html_soup.find(id = "seven-day-forecast-container")
print(seven_day.prettify())

<div id="seven-day-forecast-container">
 <ul class="list-unstyled" id="seven-day-forecast-list">
  <li class="forecast-tombstone">
   <div class="tombstone-container">
    <p class="period-name">
     Today
    </p>
    <p>
     <img alt="Today: Mostly sunny, with a high near 65. South wind 6 to 9 mph. " class="forecast-icon" src="newimages/medium/sct.png" title="Today: Mostly sunny, with a high near 65. South wind 6 to 9 mph. "/>
    </p>
    <p class="temp temp-high">
     High: 65 °F
    </p>
    <p class="short-desc">
     Mostly Sunny
    </p>
   </div>
  </li>
  <li class="forecast-tombstone">
   <div class="tombstone-container">
    <p class="period-name">
     Tonight
    </p>
    <p>
     <img alt="Tonight: Showers, mainly between 11pm and 3am.  Low around 48. South wind around 8 mph, with gusts as high as 18 mph.  Chance of precipitation is 90%. New precipitation amounts of less than a tenth of an inch possible. " class="forecast-icon" src="newimages/medium/nshra90.png" title

In [42]:
# find the time periods of the weather forecast
period_names = seven_day.find_all("p", class_ = "period-name")
period = [name.get_text() for name in period_names]
period

['Today',
 'Tonight',
 'Thursday',
 'Thursday Night',
 'Friday',
 'Friday Night',
 'Saturday',
 'Saturday Night',
 'Sunday']

In [43]:
# find the weather descriptions
descs = seven_day.find_all("p", {"class": "short-desc"})
description = [desc.get_text() for desc in descs]
description

['Mostly Sunny',
 'Showers',
 'ChanceShowers',
 'Slight ChanceShowers',
 'ChanceShowers',
 'Showers',
 'ShowersLikely',
 'Slight ChanceShowers thenMostly Clear',
 'Sunny']

In [44]:
# find the temperatures
temps = seven_day.select("p[class *= 'temp']") # css selector
temperature = [temp.get_text() for temp in temps]
temperature

['High: 65 °F',
 'Low: 48 °F',
 'High: 65 °F',
 'Low: 44 °F',
 'High: 65 °F',
 'Low: 49 °F',
 'High: 62 °F',
 'Low: 46 °F',
 'High: 67 °F']

In [45]:
# find the detailed weather descriptions
images = seven_day.select("div.tombstone-container img") # css selector
details = [image.attrs["title"] for image in images]
details

['Today: Mostly sunny, with a high near 65. South wind 6 to 9 mph. ',
 'Tonight: Showers, mainly between 11pm and 3am.  Low around 48. South wind around 8 mph, with gusts as high as 18 mph.  Chance of precipitation is 90%. New precipitation amounts of less than a tenth of an inch possible. ',
 'Thursday: A 40 percent chance of showers.  Partly sunny, with a high near 65. Southwest wind around 7 mph. ',
 'Thursday Night: A 20 percent chance of showers.  Partly cloudy, with a low around 44. Southwest wind 3 to 6 mph. ',
 'Friday: A 30 percent chance of showers, mainly after 11am.  Partly sunny, with a high near 65. South southwest wind 3 to 6 mph. ',
 'Friday Night: Showers.  Low around 49. Chance of precipitation is 90%.',
 'Saturday: Showers likely, mainly before 11am.  Partly sunny, with a high near 62. Chance of precipitation is 70%.',
 'Saturday Night: A slight chance of showers before 11pm.  Mostly clear, with a low around 46.',
 'Sunday: Sunny, with a high near 67.']

In [46]:
details[1].partition(":")[2] # remove the time period at the front

' Showers, mainly between 11pm and 3am.  Low around 48. South wind around 8 mph, with gusts as high as 18 mph.  Chance of precipitation is 90%. New precipitation amounts of less than a tenth of an inch possible. '

In [47]:
details[1].partition(":")[2].strip() # remove the leading and trailing white spaces

'Showers, mainly between 11pm and 3am.  Low around 48. South wind around 8 mph, with gusts as high as 18 mph.  Chance of precipitation is 90%. New precipitation amounts of less than a tenth of an inch possible.'

In [48]:
new_details = [detail.partition(":")[2].strip() for detail in details]
new_details

['Mostly sunny, with a high near 65. South wind 6 to 9 mph.',
 'Showers, mainly between 11pm and 3am.  Low around 48. South wind around 8 mph, with gusts as high as 18 mph.  Chance of precipitation is 90%. New precipitation amounts of less than a tenth of an inch possible.',
 'A 40 percent chance of showers.  Partly sunny, with a high near 65. Southwest wind around 7 mph.',
 'A 20 percent chance of showers.  Partly cloudy, with a low around 44. Southwest wind 3 to 6 mph.',
 'A 30 percent chance of showers, mainly after 11am.  Partly sunny, with a high near 65. South southwest wind 3 to 6 mph.',
 'Showers.  Low around 49. Chance of precipitation is 90%.',
 'Showers likely, mainly before 11am.  Partly sunny, with a high near 62. Chance of precipitation is 70%.',
 'A slight chance of showers before 11pm.  Mostly clear, with a low around 46.',
 'Sunny, with a high near 67.']

In [49]:
weather = pd.DataFrame({"Period": period,
                        "Description": description,
                        "Temperature": temperature,
                        "Detail": new_details})
weather

Unnamed: 0,Period,Description,Temperature,Detail
0,Today,Mostly Sunny,High: 65 °F,"Mostly sunny, with a high near 65. South wind ..."
1,Tonight,Showers,Low: 48 °F,"Showers, mainly between 11pm and 3am. Low aro..."
2,Thursday,ChanceShowers,High: 65 °F,"A 40 percent chance of showers. Partly sunny,..."
3,Thursday Night,Slight ChanceShowers,Low: 44 °F,A 20 percent chance of showers. Partly cloudy...
4,Friday,ChanceShowers,High: 65 °F,"A 30 percent chance of showers, mainly after 1..."
5,Friday Night,Showers,Low: 49 °F,Showers. Low around 49. Chance of precipitati...
6,Saturday,ShowersLikely,High: 62 °F,"Showers likely, mainly before 11am. Partly su..."
7,Saturday Night,Slight ChanceShowers thenMostly Clear,Low: 46 °F,A slight chance of showers before 11pm. Mostl...
8,Sunday,Sunny,High: 67 °F,"Sunny, with a high near 67."
