# Data Acquisition: Web Scraping

Data acquisition is a crucial set for developing an information retrieval system. As the bulk of data, primarily textual, are available online, we should be familiar with extracting data from a site either using API or scraping. 

The practice in this notebook will ask you to extract data from a wiki page. 
The tasks are similar to what we saw in the lab notebook, but the only difference is you have to extract two different tables into two separate data frames and then merge them.  

**Activity 1:** Scrap the wiki page "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_sector_composition" to extract the content. Create a soup object using Beautiful soup library and save the soup in a variable called wiki_soup  

In [12]:
# Your code for activity 1 goes here..
#---------------------------------------

import requests as r
from bs4 import BeautifulSoup as bs

import pandas as pd
import numpy as np

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_sector_composition'

response = r.get(url)

soup = bs(response.text, 'html')


**Activity 2:** Extract the table "GDP from natural resources" from the soup and print it.

In [11]:
# Your code for activity 2 goes here..
#---------------------------------------

all_tables = soup.find_all('table', class_='wikitable')
GDP = all_tables[3]

print(GDP)

<table class="wikitable sortable">
<tbody><tr>
<th>Country/Economy</th>
<th>Total natural resources<br/> (% of GDP)</th>
<th>Oil<br/> (% of GDP)</th>
<th>Natural gas<br/> (% of GDP)</th>
<th>Coal<br/> (% of GDP)</th>
<th>Mineral<br/> (% of GDP)</th>
<th>Forest<br/> (% of GDP)
</th></tr>
<tr>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/23px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/35px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/45px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 2x" width="23"/> </span><a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan

**Activity 3:** Create a dataframe called "resources_df" from the extracted table. Use the column names same as the column headings in original wiki table but make them valid. For example, use "country" instead of "country/economy". 

In [15]:
# Your code for activity 3 goes here..
#---------------------------------------

#Generate lists
Country=[]
Total_Nat_Resource=[]
Oil=[]
Natural_Gas=[]
Coal=[]
Mineral=[]
Forest=[]

# skip first iteration as we dont need headers 
for row in GDP.findAll("tr")[1:]: 
#     print(row)
    
    country=row.findAll('td') # To store game year which is in <th> tag
    
    if len(country)>2: #
        Country.append(country[0].find("a").find(text=True))
        Total_Nat_Resource.append(country[0].find(text=True))
        Oil.append(country[1].find(text=True))
        Natural_Gas.append(country[2].find(text=True))
        Coal.append(country[3].find(text=True))
        Mineral.append(country[4].find(text=True))
        Forest.append(country[5].find(text=True))

GDP_df = pd.DataFrame(Country, columns=['Country'])
GDP_df['Total_Nat_Resource'] = Total_Nat_Resource
GDP_df['Oil'] = Oil
GDP_df['Natural_Gas'] = Natural_Gas
GDP_df['Coal'] = Coal
GDP_df['Mineral'] = Mineral
GDP_df['Forest'] = Forest



In [16]:
GDP_df.head()

Unnamed: 0,Country,Total_Nat_Resource,Oil,Natural_Gas,Coal,Mineral,Forest
0,Afghanistan,,2.1,..,..,0,0.0
1,Albania,,5.1,4.6,0,0,0.5
2,Algeria,,26.3,19,7,0,0.3
3,Angola,,46.6,46.3,0.1,..,0.0
4,Antigua and Barbuda,,0.0,..,..,..,0.0


**Activity 4:** Extract the table "gdp per person employed(ppp) (2015) by sector" from wiki page and create a dataframe called "gdp_percent" out of it. Use the column names same as the column headings in original wiki table but make them valid. For example, use "country" instead of "country/economy". 

In [18]:
# Your code for activity 4 goes here..
#---------------------------------------

GDP_PP = all_tables[5]

# print(GDP_PP)

#Generate lists
Country=[]
Agriculture_GDP=[]
Industry_GDP=[]
Services_GDP=[]
Agriculture_Employ=[]
Industry_Employ=[]
Services_Employ=[]

# skip first iteration as we dont need headers 
for row in GDP_PP.findAll("tr")[1:]:
    
    # print (row)
    tds = row.findAll("td")
    if len(tds) > 2:
        
        country = tds[0].findAll("a")
        if len(country) < 1:
            Country.append('World')   # <------------- Handles the 'World' outlier in the table
        else:
            Country.append(country[0].find(text=True))

        Agriculture_GDP.append(tds[1].find(text=True))
        Industry_GDP.append(tds[2].find(text=True))
        Services_GDP.append(tds[3].find(text=True))
        Agriculture_Employ.append(tds[4].find(text=True))
        Industry_Employ.append(tds[5].find(text=True))
        Services_Employ.append(tds[6].find(text=True).strip())


GDP_PP_DF = df=pd.DataFrame({
    'Country':Country,
    'Agriculture_GDP':Agriculture_GDP,
    'Industry_GDP':Industry_GDP,
    'Services_GDP':Services_GDP,
    'Agriculture_Employ':Agriculture_Employ,
    'Industry_Employ':Industry_Employ,
    'Services_Employ':Services_Employ
                               })

GDP_PP_DF.head()


Unnamed: 0,Country,Agriculture_GDP,Industry_GDP,Services_GDP,Agriculture_Employ,Industry_Employ,Services_Employ
0,Afghanistan,21.4 %,22.9 %,55.7 %,61.6 %,9.9 %,28.5 %
1,Albania,22.9 %,24.2 %,53 %,42.3 %,18.1 %,39.6 %
2,United Arab Emirates,0.7 %,44.1 %,55.1 %,3.6 %,21.5 %,74.9 %
3,Argentina,6 %,28.1 %,65.9 %,2.1 %,24.7 %,73.3 %
4,Armenia,19.3 %,28.8 %,52 %,35.3 %,15.9 %,48.8 %


**Activity 5:** Combine the dataframes resources_df and gdp_percent. Name the resultant dataframe as combined_df

In [20]:
# Your code for activity 5 goes here..
#---------------------------------------

combined_df = GDP_df.merge(
                    GDP_PP_DF, 
                    how='inner', 
                    on='Country')

combined_df.head()


Unnamed: 0,Country,Total_Nat_Resource,Oil,Natural_Gas,Coal,Mineral,Forest,Agriculture_GDP,Industry_GDP,Services_GDP,Agriculture_Employ,Industry_Employ,Services_Employ
0,Afghanistan,,2.1,..,..,0,0.0,21.4 %,22.9 %,55.7 %,61.6 %,9.9 %,28.5 %
1,Albania,,5.1,4.6,0,0,0.5,22.9 %,24.2 %,53 %,42.3 %,18.1 %,39.6 %
2,Algeria,,26.3,19,7,0,0.3,12.6 %,38.8 %,48.6 %,11.4 %,35.1 %,53.5 %
3,Argentina,,6.1,4.1,1.2,0,0.8,6 %,28.1 %,65.9 %,2.1 %,24.7 %,73.3 %
4,Armenia,,2.7,..,..,..,2.7,19.3 %,28.8 %,52 %,35.3 %,15.9 %,48.8 %


**Activity 6:** Replace the the invalid values ".." with valid NAN in combined_df

In [21]:
# Your code for activity 6 goes here
#---------------------------------------

combined_df.replace('..', np.NaN, inplace=True)

combined_df.head()




Unnamed: 0,Country,Total_Nat_Resource,Oil,Natural_Gas,Coal,Mineral,Forest,Agriculture_GDP,Industry_GDP,Services_GDP,Agriculture_Employ,Industry_Employ,Services_Employ
0,Afghanistan,,2.1,,,0.0,0.0,21.4 %,22.9 %,55.7 %,61.6 %,9.9 %,28.5 %
1,Albania,,5.1,4.6,0.0,0.0,0.5,22.9 %,24.2 %,53 %,42.3 %,18.1 %,39.6 %
2,Algeria,,26.3,19.0,7.0,0.0,0.3,12.6 %,38.8 %,48.6 %,11.4 %,35.1 %,53.5 %
3,Argentina,,6.1,4.1,1.2,0.0,0.8,6 %,28.1 %,65.9 %,2.1 %,24.7 %,73.3 %
4,Armenia,,2.7,,,,2.7,19.3 %,28.8 %,52 %,35.3 %,15.9 %,48.8 %


**Activity 7:** What do you think about the NAN values in the dataset about how they should be handled. Should the rows with NAN values be deleted or imputed with some statistic like mean, median etc. Give us your thoughts.

# Save your notebook, then `File > Close and Halt`