## Collect and preprocess data

Chỉ chạy dòng dưới khi cần tải thư viện

(Ấn Ctrl + / để tắt comment)

In [4]:
#!pip install time
#!pip install requests
#!pip install numpy
#!pip install pandas
#!pip install bs4

# 1. Set-up environment

In [5]:
#Necessary Packages
import time
import requests
import numpy as np
import pandas as pd 
from bs4 import BeautifulSoup

# 2. Collect data using Web API

In this section, your work is to practice to crawl data using Web API (http://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL). This is the data of World Bank which includes demographic data and other statistics related to Population, Employment, Health, GDP, Energy Consumption,... for all countries in the world from 1960 to 2022.

From the following selected indicators:
- `SP.POP.TOTL` - Total population
- `SP.POP.TOTL.FE.IN` - Total female population
- `SP.POP.TOTL.MA.IN` - Total male population
- `SP.DYN.CBRT.IN` - Birth rate
- `SP.DYN.CDRT.IN` - Death rate
- `SP.DYN.LE00.MA.IN` - Average life expectancy of male
- `SP.DYN.LE00.FE.IN` - Average life expectancy of female
- `SE.PRM.ENRR` - Primary school enrollment rate
- `SE.TER.ENRR` - High school enrollment rate
- `SE.PRM.CMPT.ZS` - Primary completion rate
- `SE.ADT.1524.LT.ZS` - Literacy rate of people ages 15-24

You are required to collect data from 7 countries and save to dataframe `data_countries`:
- `US` - United States of America
- `IN` - India
- `CN` - China
- `JP` - Japan
- `CA` - Canada
- `GB` - Great Britain
- `ZA` - South Africa

You can expand your work on collecting data (such as collecting data from other countries and other indicators) by reading: https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-api-documentation

**Hints**:

- Use the based URL: http://api.worldbank.org/v2/
- In order to collect data for each indicator of each country, you can use the URL: "http://api.worldbank.org/v2/countries/{country_code}/indicators/{indicator_code}"
    + `country_code` and `indicator_code` are provided above.
    + For example, you can use the following URL to get the `Total population` of Japan: http://api.worldbank.org/v2/countries/jp/indicators/SP.POP.TOTL

In [6]:
BASE_URL = 'http://api.worldbank.org/v2/'
COUNTRIES = ["US", "IN", "CN", "JP", "CA", "GB", "ZA"]
INDICATORS = ['SP.POP.TOTL', 
             'SP.POP.TOTL.FE.IN', 
             'SP.POP.TOTL.MA.IN',
             'SP.DYN.CBRT.IN', 
             'SP.DYN.CDRT.IN',
             'SP.DYN.LE00.MA.IN',
             'SP.DYN.LE00.FE.IN',
             'SE.PRM.ENRR',
             'SE.TER.ENRR',
             'SE.PRM.CMPT.ZS',
             'SE.ADT.1524.LT.ZS']

#TODO (option)
# If you need other initializations
#column name
COLUMNS = ['Total Population', 
           'Female Population',
           'Male Population',
           'Birth Rate',
           'Death Rate',
           'Male life expectancy',
           'Female life expectancy', 
           'School enrollment, primary',
           'School enrollment, tertiary' ,
           'Primary completion rate' ,
           'Literacy rate' ,
           'Year' ,
           'Country']

In [7]:
def collect_data(countryCode, per_page, start_year, end_year):
    #TODO
    df = pd.DataFrame() #init main datafarme
    
    for i  in range(len(INDICATORS)): # Indicators loop
        # Read url as json format with start and end year
        url = f"{BASE_URL}countries/{countryCode}/indicators/{INDICATORS[i]}?format=json&date={start_year}:{end_year}&per_page={per_page}"
        response = requests.get(url)
        data = response.json()
        
        # Get the needed data 
        data_entries = data[1]
        dp = [] # list to get all data for each country
        
        if i == (len(INDICATORS)-1): # last indicator to get country and year
            for entry in data_entries:   
                try: 
                    value = float(entry['value']) 
                except:
                    value = 'None'
                    
                year = entry['date']
                ctry = entry['countryiso3code']
                dp.append({COLUMNS[i]: value, COLUMNS[i+1]: year, COLUMNS[i+2]: ctry})
                
        else: #before last indicator
            for entry in data_entries:
                try: 
                    value = float(entry['value'])
                except:
                    value = 'None'
                dp.append({COLUMNS[i]: value})
                
        dp = pd.DataFrame(dp) # change list type to dataframe type
        df = pd.concat([df, dp], axis = 1) # add to the right of main dataframe (add column mean axis = 1)
        
    return df
#raise NotImplementedError("not implement")

In [8]:
def Generate_Countries_Dataset(countryCode_list):
    data = pd.DataFrame()
    for countryCode in countryCode_list:
        data = pd.concat([data, collect_data(countryCode = countryCode, per_page = 100, start_year = 2000, end_year = 2022)], axis=0)
    return data

In [9]:
#TEST
data_countries = Generate_Countries_Dataset(COUNTRIES)
assert data_countries.shape == (161, 13)

In [10]:
# Save to csv file with name coutries.csv
#TODO
data_countries.to_csv('countries.csv', index = False)
#raise NotImplementedError("not implement")