<center><font size='5'>Scraping the Worldbank website</font></center>
<center><font size='3'>Eric Martin, CSE, UNSW</font></center>
<center><font size='3'>COMP9021 Principles of Programming</font></center>

In [None]:
# Does not need to be executed if
# ~/.ipython/profile_default/ipython_config.py
# exists and contains:
# c.InteractiveShell.ast_node_interactivity = 'all'

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [None]:
import os.path
import pandas as pd
import urllib.request

import bs4
import openpyxl

From the World Bank home page, http://www.worldbank.org/en/where-we-work, clicking on a country name, _India_ for instance, might take us to a page, https://www.worldbank.org/en/country/india with that example, that might display a box with _Data_ as title and a link to another page, https://data.worldbank.org/country/india with that example, that might display a number of boxes one of which might be titled _IBRD/IDA Operations Approved by Fiscal Year_, indicating an amout in US dollars, _$859.50 million_ with that example. It is not possible to get that amount for all countries listed on the home page through that sequence of two clicks. For instance:

* Clicking on Bahrein triggers the Page Not Found 404 Error.
* Clicking on Czech Republic takes us to a page for the European Union, not for the Czech Republic.
* Clicking on Israel takes us directly to the Data page.
* Clicking on Costa Rica takes us to a page where clicking on Data triggers the Page Not Found 404 Error.
* Clicking on Colombia takes us to a page where we can click on Data which takes us to a page has no box titled _IBRD/IDA Operations Approved by Fiscal Year_.


We want to restrict ourselves to countries for which the desired amount can be accessed via the described sequence of 2 clicks, and create a spreadsheet to record countries and amounts, the latter as integers; with India as example, the integer is 859500000.

Using the `urlopen()` function from the `urllib.request` module, we open the World Bank home page and pass the object it returns as first argument to the `BeautifulSoup` class from the `beautifulsoup4` module, referred to as `bs4` in code. Provided with `'html.parser'` as second argument, `BeautifulSoup()` creates an object with methods to parse the html code of the page. We place the call to `urlopen()` within a `try ... except ...` clause to catch two kinds of errors, `urllib.error.HTTPError` or `urllib.error.URLError`, that might be generated when trying to access the resource but without success:

In [None]:
try:
    with urllib.request.urlopen(
                           'http://www.worldbank.org/en/where-we-work'
                               ) as top_url:
        top_page = bs4.BeautifulSoup(top_url, 'html.parser')
except (urllib.error.HTTPError, urllib.error.URLError):
    print('Could not access the top resource.')

Searching for `www.worldbank.org/en/country/india` in the html code, we find:

In [None]:
<a class="alpha-name dropdown-item"
    href="https://www.worldbank.org/en/country/india">
    India
</a>

We use the `select()` method of the `BeautifulSoup` class to return the set of anchor tag objects for the anchor tags in the html code that have a class attribute with countryfirstLevel as value, passing `'a[class="alpha-name dropdown-item"]'` as argument to `select()`. The country name associated with each such tag in the html code can be retrieved with the object's `text` attribute, while the object's `get()` method, with `'href'` as argument, returns the value of the href attribute of the anchor tag in the htlm code. Besides printing out the full tags and for each tag, the extracted url and country name, we just try and access the urls, again catching errors of type `urllib.error.HTTPError` or `urllib.error.URLError`:

In [None]:
try:
    with urllib.request.urlopen(
                            'http://www.worldbank.org/en/where-we-work'
                               ) as top_url:
        top_page = bs4.BeautifulSoup(top_url, 'html.parser')
        for country in top_page.select('a[class="countryfirstLevel"]'):
            country_url = country.get('href')
            country_name = country.text
            print(country)
            print('   ', country_url)
            print('   ', country_name)
            try:
                with urllib.request.urlopen(country_url)\
                                                       as overview_url:
                    pass
            except (urllib.error.HTTPError, urllib.error.URLError):
                print('Could not access overview page for '
                      f'{country_name}.'
                     )
except (urllib.error.HTTPError, urllib.error.URLError):
    print('Could not access the top resource.')

All urls can be accessed except for Eswatini: the `Could not access the overview page for Eswatini.` error message is output. Checking out the associated anchor tag, we see it is

In [None]:
<a class="alpha-name dropdown-item"
   href="https://https://www.worldbank.org/en/country/eswatini">Eswatini
</a>

It has mistakenly one extra `https://`.

Let's open the overview page for India and create a `BeautifulSoup` object again:

In [None]:
try:
    with urllib.request.urlopen(
                            'http://www.worldbank.org/en/country/india'
                               ) as india_url:
        india_page = bs4.BeautifulSoup(india_url, 'html.parser')
except (urllib.error.HTTPError, urllib.error.URLError):
    print('Could not access overview page for India.')    

Searching for `data.worldbank.org/country/india` in the html code, we see that

In [None]:
<a href="https://data.worldbank.org/country/india"
   class="lp__card_link">Data
</a>

is what is of interest to access the Data page. There is another anchor tag in the html code with a class attribute whose value is `"lp__card_link"`:

In [None]:
<a href="https://projects.worldbank.org/en/
              projects-operations/projects-summary?countrycode_exact=IN"
   class="lp__card_link">View All Projects
</a>

We could loop over the set of tag objects returned by `select()` provided with `'a[class="_loop_card_link"]` as argument, and check whether the `text` attribute of those objects evaluate to `'Data'`; a closer examination of the html code for the countries for which that check always fails reveals that the text is sometimes `'DATA'` rather than `'Data'`, so we decide to go for that loop but rather check the lowercase version of the value of the `text` attribute against `'data'`. When the check succeeds, we can break out of the loop (expected to be executed at most twice, but that is irrelevant), after which further code remains to be written to access the Data page for the country under consideration and try and extract the amount of interest. When the check fails, we can print out a message indicating that no Data page has been found for the country under consideration, before going back to the outermost loop, that ranges over all countries that have been identified from the World Bank home page using the code previously written. That is achieved thanks to a `continue` statement within the body of an `else` statement associated with the innermost `for` statement.

More generally, with a `for ... else ...` statement, either the loop is exited via a `break` statement, in which case the `else` statement is ignored, or it is exited "normally" as a `StopIteration` error is generated in the background, in which case the `else` statement is executed:

In [None]:
def f(n):
    for i in range(n):
        print('Processing', i)
        if i > 2:
            print('Breaking out!')
            break
    else:
        print('All done!')
    print('What comes next?')
    
f(2)
print()
f(6)

In [None]:
try:
    with urllib.request.urlopen(
        'http://www.worldbank.org/en/where-we-work'
                               ) as top_url:
        top_page = bs4.BeautifulSoup(top_url, 'html.parser')
        for country in top_page.select(
                                  'a[class="alpha-name dropdown-item"]'
                                      ):
            country_name = country.text
            try:
                with urllib.request.urlopen(country.get('href'))\
                        as overview_url:
                    overview_page = bs4.BeautifulSoup(overview_url,
                                                      'html.parser'
                                                     )
                    for data in overview_page.select(
                                             'a[class="lp__card_link"]'
                                                    ):
                        if data.string.lower() == 'data':
                            print('Found data page for '
                                  f'{country_name}.'
                                 )
                            break
                    else:
                        print(f'No Data page for {country_name}.')
                        continue
            except (urllib.error.HTTPError, urllib.error.URLError):
                print('Could not access overview page for '
                      f'{country_name}.'
                     )
except (urllib.error.HTTPError, urllib.error.URLError):
    print('Could not access the top resource.')

Let's open the Data page for India and create a `BeautifulSoup` object again:

In [None]:
try:
    with urllib.request.urlopen(
                             'http://data.worldbank.org/country/india'
                               ) as india_data_url:
        india_data_page = bs4.BeautifulSoup(india_data_url,
                                            'html.parser'
                                           )
except (urllib.error.HTTPError, urllib.error.URLError):
    print('Could not access Data page for India.')

Searching for IBRD/IDA Operations Approved by Fiscal Year in the html code, we find:

In [None]:
<span class="name" data-reactid="297">
          IBRD/IDA Operations Approved by Fiscal Year
</span>
<div class="chart" data-reactid="298">
    <div class="chart-summry" data-reactid="299">
        <div data-reactid="300">
            <em data-reactid="301">$859.50 million</em>
            ...
        </div>
    </div>
</div>

`BeautifulSoup`'s `find()` method returns the PageElement object for the first tag in the html code those name is provided as argument, the value of the `text` argument to `find()` further restricting the search:

In [None]:
indicator = 'IBRD/IDA Operations Approved by Fiscal Year'
india_data_page.find('span', text=indicator)

We assume, based on an examination of the html code, that the previous use of `find()` returns a PageElement object that corresponds to the beginning of the quoted code snippet, not to the beginning of another code snippet somewhere else in the html code. What is of interest in the code snippet, namely, \$859.50 million, occurs within the tag that follows the "span" tag. A PageElement for that tag can be obtained thanks to the `next_sibling` attribute:

In [None]:
india_data_page.find('span', text=indicator).next_sibling

That tag, `div class="chart" data-reactid="298"`, has a first (possibily unique) child, `div class="chart-summry" data-reactid="299"`, which itself has a first (possibily unique) child, `div data-reactid="300"`, which itself has a first (possibily unique) child, `em data-reactid="301"`. The `children` attribute evaluates to an iterator, which combined with a call to `next()`, gives access to the first child of the tag under consideration:

In [None]:
next(india_data_page.find('span', text=indicator
                         ).next_sibling.children
    )

In [None]:
next(next(india_data_page.find('span', text=indicator
                              ).next_sibling.children
         ).children
    )

In [None]:
next(next(next(india_data_page.find('span', text=indicator
                                   ).next_sibling.children
              ).children).children
    )

Accessing the dollar amount is now just a matter of using the `text` attribute:

In [None]:
next(next(next(india_data_page.find('span', text=indicator
                                   ).next_sibling.children
              ).children
         ).children
    ).text

Before completing the code and extracting all amounts for all countries, provided the amount exists and our methodology makes the extraction possible, let us write a function to convert a string such as `'$859.50 million'` into an integer. Examination of the html code reveals that amounts can be expressed in millions or billions of dollars. The code that follows appears as more general, and general enough for its intended purpose:

In [None]:
def convert_to_number(amount):
    units = {'thousand': 10 ** 3, 'million': 10 ** 6,
             'billion': 10 ** 9, 'trillion': 10 ** 12
            }
    for unit in units:
        if unit in amount:
            return int(float(amount.strip().lstrip('$').rstrip(unit))
                       * units[unit]
                      )

In [None]:
convert_to_number('$859.50 million')
convert_to_number('$2.05 billion')

Getting back to extracting all dollar amounts, we slightly adapt and extend the code we wrote and implement a generator function, meant to provide on demand country name and dollar amount, for every country listed in the World Bank home page. We use another `try ... except ...` statement to generate an error message for all countries for which we can access a Data page but fail to find our indicator, or find it but within html code that is not structured as we expect it to be: 

In [None]:
def countries_and_data():
    for country in top_page.select(
                                  'a[class="alpha-name dropdown-item"]'
                                  ):
        country_name = country.text
        try:
            with urllib.request.urlopen(country.get('href'))\
                                                       as overview_url:
                overview_page = bs4.BeautifulSoup(overview_url,
                                                  'html.parser'
                                                 )
                for data in overview_page.select(
                                           'a[class="lp__card_link"]'
                                                ):
                    if data.string.lower() == 'data':
                        break
                else:
                    print(f'No Data page for {country_name}.')
                    continue
                try:
                    with urllib.request.urlopen(data.get('href'))\
                                                           as data_url:
                        data_page = bs4.BeautifulSoup(data_url,
                                                      'html.parser'
                                                     )
                        try:
                            yield country_name,\
                                  convert_to_number(next(
                                                     next(
                                                      next(
                                   data_page.find('span',
                                                  text=indicator
                                                 ).next_sibling.children
                                                          ).children
                                                         ).children
                                                        ).string
                                                    )
                        except AttributeError:
                            print(f'No {indicator} for {country_name}.')
                except (urllib.error.HTTPError, urllib.error.URLError):
                    print('Could not access Data page for '
                          f'{country_name}.'
                         )
        except (urllib.error.HTTPError, urllib.error.URLError):
            print(f'Could not access overview page for {country_name}.')

We let `enumerate()` get from `countries_and_data()` all pairs it can yield. Thanks to the `Workbook` class from the `openpyxl` module, we create an object, referred to as `workbook`, whose `active` attribute returns an object, referred to as `spreadsheet`, endowed with attributes to write into a spreadsheet:

* Thanks to the `title` attribute, we set the title to `World countries`.
* Assigning to `spreadsheet['A1']` and `spreadsheet['A2']`, we write on the first row of the first two columns the column headers, namely, `Country` and `IBRD/IDA operations`, respectively.
* With `spreadsheet.cell(row=counter, column=1).value` and `spreadsheet.cell(row=counter, column=2).value` we write on row number (the value of) `counter`, country name and amount, respectively, as yielded by the call to `enumerate()`; we pass 2 as second argument to the latter to start writing from the second row onward.
* Eventually, we save the spreadsheet as a file whose name is provided as argument to `workbook`'s `save()` method.

All that code is embedded in the code we wrote at the beginning to access the World Bank home page. It takes a few minutes for the code in the following cell to complete execution, with error messages output for each country which we fail to extract the sought after amount from.

In [None]:
try:
    with urllib.request.urlopen(
                            'http://www.worldbank.org/en/where-we-work'
                               )as top_url:
        top_page = bs4.BeautifulSoup(top_url, 'html.parser')
        workbook = openpyxl.Workbook()
        spreadsheet = workbook.active
        spreadsheet.title = 'World countries'
        spreadsheet['A1'] = 'Country'
        spreadsheet['B1'] = 'IBRD/IDA operations'
        for counter, (country, amount) in\
                                   enumerate(countries_and_data(), 2):
            spreadsheet.cell(row=counter, column=1).value = country
            spreadsheet.cell(row=counter, column=2).value = amount
        workbook.save('IBRD_IDA_operations.xlsx')
except (urllib.error.HTTPError, urllib.error.URLError):
    print('Could not access the top resource.')

At this stage, the file `IBRD_IDA_operations.xlsx` should have been generated and its contents read as follows:

In [None]:
pd.read_excel('IBRD_IDA_operations.xlsx', engine='openpyxl')