### Discussion 6

We will review an example that highlights the need of being proficient in xpath syntax, because we are not able to inspect the html using devtools. 

Consider the website [`https://imsdb.com/`](https://imsdb.com/). We want to scrape the links that are in the _Genre_ sidebar. Using devtools, we can inspect this element and find that its a child of `table/tbody`. 

A useful cheatsheet for xpath: https://devhints.io/xpath

In [1]:
import requests
import lxml.html as lx

In [10]:
result = requests.get('https://imsdb.com/')
result.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [3]:
html = lx.fromstring(result.text)

In [4]:
html.xpath('//table/tbody')

[]

The list is empty. Copying the xpath from devtools doesn't help either. Apparently, the html that the requests returns is not the same as the one rendered by Google Chrome. We can inspect whatever is being returned by checking the _Networks_ tab. 

While re-loading the webpage to monitor the communication in the _Networks_ tab, we note see that the html (for smaller dimensions which can be set in the upper left corner) is now rendered for mobile use. The sidebar with _Genre_ section is now missing. Going back to inspecting the html we see, that the genres are now listed as dropdown menu. The dropdown menu does not contain links, those are generated by a script. 

Back to the network tab! Cycling through all requests, we find that the html is returned as `Document`, but no other data is transferred. Lets inspect the request, and navigate to its _Response_ tab. We can search it for the string `Genres`. We find three instances, but all preparing the script, none containing the links. While dealing with scripts was presented in todays lecture, we should adjust the dimensions (upper left corner) to something larger (e.g., _Nest Hub Max_). 

A new request will now return a different html. Searching for the string `Genre` will now find the corresponding table, its in a different element structure as in our first attempt. However, ... 

In [9]:
html.xpath('//td[text()="Genre"]')

[]

Some whitespace characters prevent us from finding the element! (Direct inspection of `request.text` shows that its `"Genre\r\n"`! 

In [6]:
html.xpath('//td[contains(text(), "Genre")]') 

[<Element td at 0x7fe702289800>]

Now, how to get the correct anchors? 

In [7]:
html.xpath('//table[tr/td[contains(text(), "Genre")]]/tr//a/@href') 

['/genre/Action',
 '/genre/Adventure',
 '/genre/Animation',
 '/genre/Comedy',
 '/genre/Crime',
 '/genre/Drama',
 '/genre/Family',
 '/genre/Fantasy',
 '/genre/Film-Noir',
 '/genre/Horror',
 '/genre/Musical',
 '/genre/Mystery',
 '/genre/Romance',
 '/genre/Sci-Fi',
 '/genre/Short',
 '/genre/Thriller',
 '/genre/War',
 '/genre/Western']

Perfect! Now, consider the [_Interstellar_](https://imsdb.com/Movie%20Scripts/Interstellar%20Script.html) page. We want to retrieve the movie release year. After inspecting the html (it might not be accurate!), we find that the date is the content of a `<td>` element, but is cluttered between a variety of other elements. 

In [8]:
result = requests.get('https://imsdb.com/Movie%20Scripts/Interstellar%20Script.html')
result.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [9]:
html = lx.fromstring(result.text)

In [10]:
html.xpath('//table[@class="script-details"]//td/text()') 

['\r\n      ',
 '\xa0\xa0Excellent.',
 '\r\n      ',
 '\xa0\xa0',
 ' (10 out of 10)',
 '\r\n\t  ',
 '\xa0\xa0',
 ' (9.50 out of 10)',
 '\r\n\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n\t  \r\n      ',
 '\r\n      ',
 '\xa0\xa0',
 '\r\n      ',
 '\xa0\xa0',
 '\xa0\xa0',
 '\xa0\xa0',
 '\r\n\t ',
 ' : March 2008',
 '\r\n\t ',
 ' : November 2014',
 '\r\n\t \r\n',
 '\r\n',
 '\r\n']

Its there, but how to we retrieve the correct element text? 

In [11]:
html.xpath('//b[text() = "Script Date"]/following-sibling::text()[1]')

[' : March 2008']

From here, we will use regular expressions to extract the digits of the year. We will learn about regular expressions next week. In the meantime, become an xpath [ninja](https://topswagcode.com/xpath/)!