# Scraping data generated by JavaScript

In [1]:
# When coding in Jupyter and Spyder, we need to use the class AsyncHTMLSession to make JavaScript work
# In other environments you can use the normal HTMLSession
from requests_html import AsyncHTMLSession

In [2]:
# establish a new asynchronous session
session = AsyncHTMLSession()

# The only difference we will experience between the regular HTML Session and the asynchronous one,
# is the need to write the keyword 'await' in front of some statements

In [3]:
# In this example we're going to use Nike's homepage: https://www.reddit.com/
# Several of the links on this page, as well as other elements, are generated by JavaScript
# We will compare the result of scraping those before and after running the JavaScript code

In [4]:
# Since we used async session, we need to use the keyword 'await'
# If you use the regular HTMLSession, there is no need for 'await'
r = await session.get("https://www.reddit.com/")
r.status_code

200

In [5]:
# So far, nothing different from our previous example has happened
# The JavaScript code has not yet been executed

In [6]:
# Here are some tags obtained before rendering the JavaScript code, i.e. extarcted from the raw HTML
divs = r.html.find("div")
links = r.html.find("a")
urls = r.html.absolute_links

In [7]:
# Now, we need to execute the JavaScript code that will generate additional tags

In [8]:
# The requests-html package provides a very simple interface for that - just use the 'render()' method
# ('arender()' when using async session)
# It runs the JavaScript code which updates the HTML. This may take a bit
# The updated HTML is stored in the old variable 'r.html' - you do not need to assign a new variable to the method
# As before, the 'await' keyword is supplied only because of the Async session
await r.html.arender()

In [9]:
# NOTE: The first time you run 'a/render()' Chromium will be downloaded and installed on your computer

In [10]:
# Now the HTML is updated and we can search for the same tags again
new_divs = r.html.find("div")
new_links = r.html.find("a")
new_urls = r.html.absolute_links

In [11]:
# We can see the difference in the number of found elements before and after the JavaScript executed

In [12]:
len(divs), len(new_divs)

(543, 1728)

In [13]:
len(links), len(new_links)

(87, 681)

In [14]:
len(urls), len(new_urls)

(58, 640)

In [15]:
# Remember that 'urls' is a set, and not a list?
# Well, there is a useful feature of sets that we will now take advantage of
# It takes two sets and selects only those items from the first set that are not present in the second one

In [16]:
# Take only the new items in the first set
new_urls.difference(urls)

{'https://i.imgur.com/nMhodgS.gifv',
 'https://www.reddit.com/r/1200isplenty/',
 'https://www.reddit.com/r/2007scape/',
 'https://www.reddit.com/r/49ers/',
 'https://www.reddit.com/r/90DayFiance/',
 'https://www.reddit.com/r/ACMilan/',
 'https://www.reddit.com/r/Adelaide/',
 'https://www.reddit.com/r/Amd/',
 'https://www.reddit.com/r/Android/',
 'https://www.reddit.com/r/Animesuggest/',
 'https://www.reddit.com/r/AnthemTheGame/',
 'https://www.reddit.com/r/AskCulinary/',
 'https://www.reddit.com/r/AskMen/',
 'https://www.reddit.com/r/AskNYC/',
 'https://www.reddit.com/r/AskReddit/',
 'https://www.reddit.com/r/AskWomen/',
 'https://www.reddit.com/r/Astros/',
 'https://www.reddit.com/r/Atlanta/',
 'https://www.reddit.com/r/AtlantaUnited/',
 'https://www.reddit.com/r/Augusta/',
 'https://www.reddit.com/r/Austria/',
 'https://www.reddit.com/r/Barca/',
 'https://www.reddit.com/r/BattlefieldV/',
 'https://www.reddit.com/r/BeautyBoxes/',
 'https://www.reddit.com/r/BeautyGuruChatter/',
 'https

In [17]:
# Finally, close the session
session.close()

<coroutine object AsyncHTMLSession.close at 0x00000260CA7AA1C8>

In [18]:
# You can check the documentation directly inside Jupyter
print(r.html.render.__doc__)

Reloads the response in Chromium, and replaces HTML content
        with an updated version, with JavaScript executed.

        :param retries: The number of times to retry loading the page in Chromium.
        :param script: JavaScript to execute upon page load (optional).
        :param wait: The number of seconds to wait before loading the page, preventing timeouts (optional).
        :param scrolldown: Integer, if provided, of how many times to page down.
        :param sleep: Integer, if provided, of how many long to sleep after initial render.
        :param reload: If ``False``, content will not be loaded from the browser, but will be provided from memory.
        :param keep_page: If ``True`` will allow you to interact with the browser page through ``r.html.page``.

        If ``scrolldown`` is specified, the page will scrolldown the specified
        number of times, after sleeping the specified amount of time
        (e.g. ``scrolldown=10, sleep=1``).

        If just ``sleep