# Exploring the requests-html capabilities

`requests` library and `BeautifulSoup` package are simple and easy to use. 
<br> However, both have limitation when dealing with big projects or JavaScript websites.

Good news is that there are more tools available which are compatible with Python. 
<br> `requests_html` package is one of them and was created by the author of the 'requests' library.

### Official documentation of the requests-html package [here](https://requests-html.readthedocs.io/en/latest/)

## Initial setup

In [1]:
# Loading the necessary packages
from requests_html import HTMLSession

In [2]:
# establish/open a session
session = HTMLSession()

In [3]:
# submitting a GET request
response = session.get("https://en.wikipedia.org/wiki/Association_football")
response.status_code

200

In [21]:
# The html response to the GET request is contained in the '.html' method
print(response.html)

<HTML url='https://en.wikipedia.org/wiki/Association_football'>


## Links

In [18]:
# We can extract all link addresses directly with '.links'
urls = response.html.links

# An important thing to note is that these links (given by both methods) are returned in a SET, not a LIST
print(type(urls))

<class 'set'>


In [19]:
list(urls)[1:10] #to present in a list

['/wiki/Tejo_(sport)',
 '/wiki/Special:WhatLinksHere/Association_football',
 '/wiki/File:Football_pitch_metric.svg',
 '/wiki/1917_in_association_football',
 '/wiki/Imperial_units',
 '/wiki/Special:BookSources/1-85613-341-9',
 '/wiki/Curl_(association_football)',
 '/wiki/Red_Star_Belgrade',
 '/wiki/International_rules_football']

#### Note that those are the relative URLs 

In [14]:
# To get absolute URLs we can use '.absolute_links' instead of '.links'
full_path_urls = response.html.absolute_links

<class 'set'>


In [20]:
list(full_path_urls)[1:10] #to present in a list

['https://en.wikibooks.org/wiki/Special:Search/Association_football',
 'https://en.wikipedia.org/wiki/Paintball',
 'https://en.wikipedia.org/wiki/1906_in_association_football',
 'https://en.wikipedia.org/wiki/Basketball',
 'https://en.wikipedia.org/wiki/Synchronised_swimming',
 'https://en.wikipedia.org/wiki/Parliament',
 'https://en.wikipedia.org/wiki/FIFA_Confederations_Cup',
 'https://en.wikipedia.org/wiki/1975_in_association_football',
 'https://en.wikipedia.org/wiki/Rink_bandy']

## Searching for elements

#### A quick note: requests-html uses CSS selectors for searching

[Here](https://www.w3schools.com/cssref/css_selectors.asp) is a more thorough look into it. 

In [22]:
# We can search for elements similarly to Beautiful Soup using the find() method
# It behaves as find_all()

# find all 'a' tags
links = response.html.find("a")
links[1:10]

[<Element 'a' href='/wiki/Wikipedia:Featured_articles' title='This is a featured article. Click here for more information.'>,
 <Element 'a' href='/wiki/Wikipedia:Protection_policy#semi' title='This article is semi-protected.'>,
 <Element 'a' href='/wiki/File:Football_(soccer)_Part_One.ogg' title='Listen to this article'>,
 <Element 'a' class=('mw-jump-link',) href='#mw-head'>,
 <Element 'a' class=('mw-jump-link',) href='#searchInput'>,
 <Element 'a' href='/wiki/Soccer_(disambiguation)' class=('mw-disambig',) title='Soccer (disambiguation)'>,
 <Element 'a' href='/wiki/Soccer_Team_(band)' title='Soccer Team (band)'>,
 <Element 'a' href='/wiki/Football' title='Football'>,
 <Element 'a' href='/wiki/File:Football_iu_1996.jpg' class=('image',)>]

In [11]:
links[4]

<Element 'a' class=('mw-jump-link',) href='#mw-head'>

In [12]:
# To get the raw HTML of an element use the '.html' method
links[4].html

'<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>'

In [13]:
type(links[4].html)

str

In [14]:
# To extract the text inside an element, use ".text", just like in Beautiful Soup
links[4].text

'Jump to navigation'

In [15]:
# To obtain a dictionary of the element's attributes, use '.attrs' (exactly as in Beautiful Soup)
links[10].attrs

{'href': '/wiki/UEFA_Champions_League', 'title': 'UEFA Champions League'}

In [24]:
# This package offers a couple of ways to filter tags based off text

# Choose only those tags that contain the string 'wikipedia' in their text (not in the 'href' attribute)
# Note: this is not case-sensitive
response.html.find("a", containing = "wikipedia")[1:10]

[<Element 'a' href='/wiki/Wikipedia:About' title='Wikipedia:About'>,
 <Element 'a' href='//shop.wikimedia.org' title='Visit the Wikipedia store'>,
 <Element 'a' href='/wiki/Wikipedia:About' title='Learn about Wikipedia and how it works'>,
 <Element 'a' href='/wiki/Category:Wikipedia_articles_with_NDL_identifiers' title='Category:Wikipedia articles with NDL identifiers'>,
 <Element 'a' href='/wiki/Category:Wikipedia_articles_with_NARA_identifiers' title='Category:Wikipedia articles with NARA identifiers'>,
 <Element 'a' href='/wiki/Category:Wikipedia_articles_with_LCCN_identifiers' title='Category:Wikipedia articles with LCCN identifiers'>,
 <Element 'a' href='/wiki/Category:Wikipedia_articles_with_HDS_identifiers' title='Category:Wikipedia articles with HDS identifiers'>,
 <Element 'a' href='/wiki/Category:Wikipedia_articles_with_GND_identifiers' title='Category:Wikipedia articles with GND identifiers'>,
 <Element 'a' href='/wiki/Category:Wikipedia_articles_with_BNF_identifiers' title=

In [25]:
# display the text of those tags
[tag.text for tag in response.html.find("a", containing = "wikipedia")]

['Contact Wikipedia',
 'About Wikipedia',
 'Wikipedia store',
 'About Wikipedia',
 'Wikipedia articles with NDL identifiers',
 'Wikipedia articles with NARA identifiers',
 'Wikipedia articles with LCCN identifiers',
 'Wikipedia articles with HDS identifiers',
 'Wikipedia articles with GND identifiers',
 'Wikipedia articles with BNF identifiers',
 'Wikipedia indefinitely semi-protected pages',
 'https://en.wikipedia.org/w/index.php?title=Association_football&oldid=969143348']

In [26]:
# If we wish to find only the first element (similarly to Beautiful Soup .find()) we need to specify the 'first' parameter
response.html.find("p", first = True)

<Element 'p' class=('mw-empty-elt',)>

---
## Searching for text

In [27]:
response.html.search("known{}soccer")

<Result (' as football field, football ground, ',) {}>

In [29]:
response.html.search("known{}soccer")[0].strip()

'as football field, football ground,'

In [36]:
response.html.search_all("known{}soccer")[3]

<Result (' as the <a href="/wiki/Laws_of_the_Game_(association_football)" title="Laws of the Game (association football)">Laws of the Game</a>. The game is played using a spherical ball of 68–70&#160;cm (27–28&#160;in) circumference,<sup id="cite_ref-70" class="reference"><a href="#cite_note-70">&#91;68&#93;</a></sup> known as the <i><a href="/wiki/Ball_(association_football)" title="Ball (association football)">football</a></i> (or <i>',) {}>

---
## CSS Selectors

[Here](https://www.w3schools.com/cssref/css_selectors.asp) is a more thorough look into it. 

### Select elements based on tag name

In [38]:
response.html.find('span')[0:10]

[<Element 'span' id='Etymology'>,
 <Element 'span' id='Names'>,
 <Element 'span' class=('toctogglespan',)>,
 <Element 'span' class=('tocnumber',)>,
 <Element 'span' class=('toctext',)>,
 <Element 'span' class=('tocnumber',)>,
 <Element 'span' class=('toctext',)>,
 <Element 'span' class=('tocnumber',)>,
 <Element 'span' class=('toctext',)>,
 <Element 'span' class=('tocnumber',)>]

### Select elements based on ID

In [43]:
response.html.find('#Name')

[<Element 'span' class=('mw-headline',) id='Name'>]

In [41]:
response.html.find('#name') # case_sensitive

[]

### Selecting by class

In [56]:
response.html.find('.mw-headline')[0:5]

[<Element 'span' class=('mw-headline',) id='Name'>,
 <Element 'span' class=('mw-headline',) id='History'>,
 <Element 'span' class=('mw-headline',) id="Women's_association_football">,
 <Element 'span' class=('mw-headline',) id="Early_women's_football">,
 <Element 'span' class=('mw-headline',) id='20th_and_21st_century'>]

In [48]:
response.html.find('.metadata') 

[<Element 'div' role='navigation' aria-labelledby='sister-projects' class=('metadata', 'plainlinks', 'sistersitebox', 'plainlist', 'mbox-small') style='border:1px solid #aaa; padding:0; background:#f9f9f9;'>]

In [49]:
response.html.find('.metadata.plainlinks') #in case that we find more than 1 class

[<Element 'div' role='navigation' aria-labelledby='sister-projects' class=('metadata', 'plainlinks', 'sistersitebox', 'plainlist', 'mbox-small') style='border:1px solid #aaa; padding:0; background:#f9f9f9;'>]

### Selecting based on other attributes

In [51]:
response.html.find('[target]')

[<Element 'a' href='//upload.wikimedia.org/wikipedia/commons/3/30/O_Jogo_Bonito_%28The_Beautiful_Game%29.webm' title='Play media' target='new'>]

In [57]:
response.html.find('[role=note]')[1:5]

[<Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>,
 <Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>,
 <Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>,
 <Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>]

In [58]:
response.html.find('[href*=wikipedia]')[1:5]

[<Element 'link' rel=('shortcut', 'icon') href='/static/favicon/wikipedia.ico'>,
 <Element 'link' rel=('EditURI',) type='application/rsd+xml' href='//en.wikipedia.org/w/api.php?action=rsd'>,
 <Element 'link' rel=('canonical',) href='https://en.wikipedia.org/wiki/Association_football'>,
 <Element 'a' href='//upload.wikimedia.org/wikipedia/commons/3/30/O_Jogo_Bonito_%28The_Beautiful_Game%29.webm' title='Play media' target='new'>]

### Combining different filters together

In [60]:
# Searching for a tags with attribute href including wikipedia
response.html.find('a[href*=wikipedia]')[1:5]

[<Element 'a' class=('external', 'text') href='https://en.wikipedia.org/w/index.php?title=Template:Association_football&action=edit'>,
 <Element 'a' class=('external', 'text') href='https://en.wikipedia.org/w/index.php?title=Template:International_football&action=edit'>,
 <Element 'a' class=('external', 'text') href='https://en.wikipedia.org/w/index.php?title=Template:Association_football_laws&action=edit'>,
 <Element 'a' class=('external', 'text') href='https://en.wikipedia.org/w/index.php?title=Template:Association_football_terminology&action=edit'>]

In [65]:
# Searching for a tags with class called internal
response.html.find('a.internal')[1:5]

[<Element 'a' href='/wiki/File:Mia1997.JPG' class=('internal',) title='Enlarge'>,
 <Element 'a' href='/wiki/File:Women%27s_football_match_Menai_Bridge_against_Penrhos_(24622680915).jpg' class=('internal',) title='Enlarge'>,
 <Element 'a' href='/wiki/File:U20-WorldCup2007-Okotie-Onka_edit2.jpg' class=('internal',) title='Enlarge'>,
 <Element 'a' href='/wiki/File:Slidetackle.JPG' class=('internal',) title='Enlarge'>]

In [69]:
response.html.find('div.thumb')[1:5]

[<Element 'div' class=('thumb', 'tright')>,
 <Element 'div' class=('thumb', 'tright')>,
 <Element 'div' class=('thumb', 'tright')>,
 <Element 'div' class=('thumb', 'tright')>]

In [76]:
print(response.html.find('div[role=note]')[0])
response.html.find('div[role=note][class="hatnote navigation-not-searchable"]')[1:5]

<Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>


[<Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>,
 <Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>,
 <Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>,
 <Element 'div' role='note' class=('hatnote', 'navigation-not-searchable')>]

Like above, when searching for class together with [attribute] it must contain the full class name. Also, as there is a blank in between, we need quotes("") to enclose.

### Incorporting tag hierarchy

In [78]:
response.html.find('h2 span')[1:5] #two tags, span tag inside h2 tag

[<Element 'span' class=('mw-headline',) id='History'>,
 <Element 'span' class=('mw-headline',) id='Gameplay'>,
 <Element 'span' class=('mw-headline',) id='Laws'>,
 <Element 'span' class=('mw-headline',) id='Governing_bodies'>]

In [79]:
#search for a direct child of a parent tag
response.html.find('div > p')[1:5]

[<Element 'p' class=('mw-empty-elt',)>,
 <Element 'p' >,
 <Element 'p' >,
 <Element 'p' >]

After complting the scraping using requests_html package, we must close session/

In [80]:
session.close()