# Advanced Web Scraping and Data Gathering

Complete the tasks listed below. You can submit the completed lab until 11:59 PM in the night.

<u>Requirement:</u><br>
Do your best to write Pythonic code instead of the traditional programming code.

### Task 1 (0.5 mark)

Import the necessary libraries you would need to first create a soup with bs4 and later extract a table as a Dataframe from it. Load the data from the file __List of countries by GDP (nominal) - Wikipedia.htm__. You can close the file handle as soon as you have created the soup as you won't need it afterwards.

<u>Hint</u>: You would need to figure out the correct encoding to read this file, otherwise it would result in an error.

In [12]:
### Write your code below this comment.
from bs4 import BeautifulSoup
import pandas as pd 
fhtml = open('data/List of countries by GDP (nominal) - Wikipedia.htm', 'r',encoding = 'utf-8')
soup = BeautifulSoup(fhtml)
fhtml.close()

### Task 2 (0.5 mark)

Write code to report the total number of tables present on the web page.

In [13]:
### Write your code below this comment.
tables = soup.find_all('table')
print("The total number of tables present on the web page is {}.".format(len(tables)))

The total number of tables present on the web page is 9.


### Task 3 (1 mark)

Find the table with data about countries and their GDP and store it in a variable named __data_table__. Also print its type.

<u>Hint</u>: As an additional argument for the method you use to find the table, you can use a dictionary with the __class__ attribute (key) and figure out the value for the key by examining the web page.

In [14]:
### Write your code below this comment.
data_table = soup.find('table', {'class':'"wikitable"|}'})
print("The type of the table is {}.".format(type(data_table)))

The type of the table is <class 'bs4.element.Tag'>.


### Task 4 (2 marks)

Figure out how many captions the data table has. The captions include the sources `Per the International Monetary Fund (2017)[1]`, `Per the World Bank (2017)[20]`, and	`Per the United Nations (2016)[21][22]`. Store the table elements containing the captions in a list named __sources_list__. Also report the number of elements in the list.

Then go ahead and extract the GDP data tables present inside the main data table. Store the GDP data tables in a list named __data_tables__ for later use. Also report the number of tables in this list.

<u>Hint</u>: It's the same table you found in Task 3 above which contains data about countries and their GDP as three separate tables within it.

In [15]:
### Write your code below this comment.
table_rows = data_table.find_all('tr', limit=2)
sources_list = table_rows[0].find_all('td')
print("The table has {} captions.".format(len(sources_list)))

The table has 3 captions.


In [16]:
data_tables = []
for child in table_rows[1].children:
  if child=='\n':
    continue
  data_tables.append(child)
print("The list has {} tables.".format(len(data_tables)))

The list has 3 tables.


### Task 5 (1 mark)

Now go ahead and extract the names of source organizations (`'International Monetary Fund'`, `'World Bank'`, and `'United Nations'`) from the __sources_list__ you created in Task 4 above.

In [17]:
### Write your code below this comment.
organizations = []
for element in sources_list:
  element_a = element.find('a')
  organizations.append(element_a['title'])
organizations 

['International Monetary Fund', 'World Bank', 'United Nations']

### Task 6 (2 marks)

Using the __data_tables__ list from Task 4 above, separate the header and data for the first source GDP data table. Then create a Dataframe that looks as follows:

<img align=left src="images/df.png" height="270" width="270">

In [18]:
### Write your code below this comment.
first_table = data_tables[0]

table_headers = first_table.thead.find_all('th')
headers = [th.get_text(strip=True) for th in table_headers]

rows = first_table.tbody.find_all('tr')[1:]
data = [[td.get_text(strip=True) for td in row.find_all('td')] for row in rows]

imf_df = pd.DataFrame(data, columns=headers)
imf_df.head()


Unnamed: 0,Rank,Country,GDP(US$MM)
0,1,United States,19390600
1,2,China[n 1],12014610
2,3,Japan,4872135
3,4,Germany,3684816
4,5,United Kingdom,2624529


### Task 7 (3 marks)

Now do the same for the other two source GDP data tables. However, this time around your task is more complex. This is because you may see a long unwanted number such as `7007193906040000000` followed by the character `♠` in your resulting Dataframe as follows:

<img align=left src="images/weird_df.png" height="400" width="400">

Therefore, you would need to write a small function named __find_right_text__ that finds these unwanted numbers and `♠` characters and removes them from the data rows of the list you would use to create your Dataframe.

<u>Hint</u>: The function __find_right_text__ can take two arguments `(i, td)` to figure out the index of the `<td>` element it receives from the list comprehension. Depending upon the index, the function can use __getText()__ followed by either __strip()__ or __find()__ on top of it. You may also need to use the __enumerate()__ function in your list comprehension to get the desired results.

In [19]:
### Write your code below this comment.
def convert_table_to_df(table):
  table_headers = table.thead.find_all('th')
  headers = [th.get_text(strip=True) for th in table_headers]

  data = []
  rows = table.tbody.find_all('tr')[1:]
  for row in rows:
    tds = row.find_all('td')
    data.append([
      tds[0].get_text(strip=True),
      tds[1].get_text(strip=True),
      [string for string in tds[2].stripped_strings][-1]
    ])

  return pd.DataFrame(data, columns=headers)

In [20]:
second_table = data_tables[1]
wb_df = convert_table_to_df(second_table)
wb_df.head()

Unnamed: 0,Rank,Country,GDP(US$MM)
0,1.0,United States,19390604
1,,European Union[23],17277698
2,2.0,China[n 4],12237700
3,3.0,Japan,4872137
4,4.0,Germany,3677439


In [21]:
third_table = data_tables[2]
un_df = convert_table_to_df(third_table)
un_df.head()

Unnamed: 0,Rank,Country,GDP(US$MM)
0,1,United States,18624475
1,2,China[n 4],11218281
2,3,Japan,4936211
3,4,Germany,3477796
4,5,United Kingdom,2647898
