# Using the World Bank APIs & the Feather Format
Adapted from: https://medium.com/mlearning-ai/6-data-engineering-extract-data-using-apis-1870a1adbd3b

### APIs

Instead of downloading World Bank data via a csv file, you're going to download the data using the World Bank APIs. The purpose of this exercise is to gain experience with another way of extracting data.

API is an acronym that stands for <B>application programming interface.</B> API’s provide a standardized way for two applications to talk to each other. In this case, the applications communicating with each other are the server application where World Bank stores data and your Jupyter notebook.

If you wanted to pull data directly from the World Bank’s server, you’d have to know what database system the World Bank was using. You’d also need permission to log in directly to the server, which would be a security risk for the World Bank. And if the World Bank ever migrated its data to a new system, you would have to rewrite all of your code again. The API allows you to execute code on the World Bank server without getting direct access.

### Before there were APIs

Before there were APIs, there was web scraping. People would download html directly from a website and then parse the results programatically. This practice is in a legal grey area. One reason that APIs became popular was so that companies could provide data to users and discourage web scraping.

Here are a few articles about the legality of web scraping.

All sorts of companies have public facing APIs including Facebook, Twitter, Google and Pinterest. You can pull data from these companies to create your own applications.

In this notebook, you’ll get practice using Python to pull data from the World Bank indicators API.

Here are links to information about the World Bank indicators and projects APIs if you want to learn more:
* [World Bank Indicators API](https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation)
* [World Bank Projects API](http://search.worldbank.org/api/v2/projects)

### Using APIs

In general, you access APIs via the web using a web address. Within the web address, you specify the data that you want. To know how to format the web address, you need to read an API's documentation. Some APIs also require that you send login credentials as part of your request. The World Bank APIs are public and do not require login credentials.

The Python requests library makes working with APIs relatively simple.

### Indicators by Country API

Run the code example below to request data from the World Bank Indicators API. According to the documentation, you format your request url like so:

`http://api.worldbank.org/v2/countries/` + list of country abbreviations separated by ; + `/indicators/` + indicator name + `?` + options

where options can include
* per_page - number of records to return per page
* page - which page to return - eg if there are 5000 records and 100 records per page
* date - filter by dates
* format - json or xml
 
 and a few other options that you can read about [here](https://datahelpdesk.worldbank.org/knowledgebase/articles/898581-api-basic-call-structure).

In [1]:
# You'll have to install the libraries we'll use.
# Once you run this once, you can comment out the line. 
%conda install requests pandas feather-format boto3

# Now import the libraries
import requests # the library we'll use to call the API
import pandas as pd # good ole pandas
import json # Used to deal with the json objects that return from the API
import boto3 # You know this one!
import io  # Used for some input/output functions below

In [2]:
# Build a sample url. Notice it is a string formatted in a very specific way:
# While there are 244 rows of data, I'm just asking for 3 rows (per_page=3)
url = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=3'

# Call the url using the requests.get() function
r = requests.get(url)

# When you look at the whole response, it is a list of 2 lists:
#   First list is some metadata about the data
#   Second list is a list of dicts, each dict containing the rows of data
#
# Look at the first list (index[0]), which is the metadata:
print('Metadata: \n', json.dumps(r.json()[0], indent=1, sort_keys=False))
#
# Now look at the data (index[1])
print('\nHere is the data:')
print(json.dumps(r.json()[1], indent=1, sort_keys=False))  # Only 3 rows of data

Metadata: 
 {
 "page": 1,
 "pages": 82,
 "per_page": 3,
 "total": 244,
 "sourceid": "2",
 "sourcename": "World Development Indicators",
 "lastupdated": "2021-12-16"
}

Here is the data:
[
 {
  "indicator": {
   "id": "SP.POP.TOTL",
   "value": "Population, total"
  },
  "country": {
   "id": "BR",
   "value": "Brazil"
  },
  "countryiso3code": "BRA",
  "date": "2020",
  "value": 212559409,
  "unit": "",
  "obs_status": "",
  "decimal": 0
 },
 {
  "indicator": {
   "id": "SP.POP.TOTL",
   "value": "Population, total"
  },
  "country": {
   "id": "BR",
   "value": "Brazil"
  },
  "countryiso3code": "BRA",
  "date": "2019",
  "value": 211049519,
  "unit": "",
  "obs_status": "",
  "decimal": 0
 },
 {
  "indicator": {
   "id": "SP.POP.TOTL",
   "value": "Population, total"
  },
  "country": {
   "id": "BR",
   "value": "Brazil"
  },
  "countryiso3code": "BRA",
  "date": "2018",
  "value": 209469320,
  "unit": "",
  "obs_status": "",
  "decimal": 0
 }
]


#### So what indicators are avaialbe? 
Go to this search box and search for: SP.POP.TOTL
* [Website for all World Bank indicators](https://data.worldbank.org)
<br>
<p>
To find the indicator code, first search for the indicator here: https://data.worldbank.org
Click on the indicator name. The indicator code is in the url. For example, the indicator code for total population is SP.POP.TOTL, which you can see in the link:<p>
https://data.worldbank.org/indicator/SP.RUR.TOTL     

#### Load all rows into a pandas DataFrame

In [3]:
# Let's build the same url by components:

base_url = 'http://api.worldbank.org/v2/countries/'
ctry = 'br;cn;us;de/'
ind = 'SP.POP.TOTL/?'
form = 'format=json'
# We know there are 244 rows, so let's get all of them in 1 page
num  = '&per_page=500'

# Build the final string
url = base_url + ctry + 'indicators/' + ind + form + num
print(url)

# Call the API again
r = requests.get(url)

# Now simply load the data into a dataframe
df = pd.DataFrame(r.json()[1])
df.head(4)

http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=500


Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'BR', 'value': 'Brazil'}",BRA,2020,212559409,,,0
1,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'BR', 'value': 'Brazil'}",BRA,2019,211049519,,,0
2,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'BR', 'value': 'Brazil'}",BRA,2018,209469320,,,0
3,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'BR', 'value': 'Brazil'}",BRA,2017,207833825,,,0


#### Country code list:
* [2-character iso country codes](https://www.nationsonline.org/oneworld/country_code_list.htm)

### Get Indicator data for All Countries

Before, we called an indicator for a country or a list of countries.  Here, we are going get all the data by an indicator.

* [Website to search for indicator by name](https://data.worldbank.org/indicator?tab=all)

In [4]:
# Indicator: Life Expectancy at Birth: SP.DYN.LE00.IN
# Let's build the url by components

base_url = 'http://api.worldbank.org/v2/country/all/indicator/'
ind = 'SP.DYN.LE00.IN/?'
form = 'format=json'
# We don't know the number of rows, let's get a few and see how many rows exist
num  = '&per_page=3'

# Build the final string
url = base_url + ind + form + num
print(url)

# Call the API
r = requests.get(url)

# From the first list (metadata) identify the total number of rows
total_rows = r.json()[0]['total']
total_rows

http://api.worldbank.org/v2/country/all/indicator/SP.DYN.LE00.IN/?format=json&per_page=3


16226

In [5]:
# Update the query string
num  = '&per_page=' + str(total_rows)  # It has to be a string

# Build the final string
url = base_url + ind + form + num
print(url)

# Call the API
r = requests.get(url)

# Load the data into a dataframe
df2 = pd.DataFrame(r.json()[1])
print('The number of rows in the new dataframe is:',len(df2))
df2.head(2)

http://api.worldbank.org/v2/country/all/indicator/SP.DYN.LE00.IN/?format=json&per_page=16226
The number of rows in the new dataframe is: 16226


Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"{'id': 'SP.DYN.LE00.IN', 'value': 'Life expect...","{'id': 'ZH', 'value': 'Africa Eastern and Sout...",AFE,2020,,,,0
1,"{'id': 'SP.DYN.LE00.IN', 'value': 'Life expect...","{'id': 'ZH', 'value': 'Africa Eastern and Sout...",AFE,2019,64.005197,,,0


## Saving to a Feather formatted file
- The primary reason for the existence of Feather is to have a data format which data frames can be exchanged between Python and R.
- Feather is a binary data format.
- Using feather enables faster I/O speeds and less memory. However, since it is an evolving format it is recommended to use it for quick loading and transformation related data processing rather than using it as a long term storage.
- The dataframe is persisted into a disk file in Feather format by calling the to_feather() method on the dataframe object.
- The contents of the disk file is read back by calling the method read_feather() method of the pandas module and printed onto the console.

<P>
Code to write/read dataframes to Feather formatted file in the R language:<br>

```
library(feather)
path <- "df.feather"
write_feather(df, path)
df <- read_feather(path)
```



In [6]:
# Save the df2 DataFrame to a Feather formatted file in the current directory in SageMaker.
df2.to_feather('df2.feather')

In [7]:
# Read the Feather file into a new DataFrame. 
df3 = pd.read_feather('df2.feather')
df3.head(3)

Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"{'id': 'SP.DYN.LE00.IN', 'value': 'Life expect...","{'id': 'ZH', 'value': 'Africa Eastern and Sout...",AFE,2020,,,,0
1,"{'id': 'SP.DYN.LE00.IN', 'value': 'Life expect...","{'id': 'ZH', 'value': 'Africa Eastern and Sout...",AFE,2019,64.005197,,,0
2,"{'id': 'SP.DYN.LE00.IN', 'value': 'Life expect...","{'id': 'ZH', 'value': 'Africa Eastern and Sout...",AFE,2018,63.648988,,,0


In [8]:
# df2 & df3 should be the same. Make sure they are the same.
if df2.equals(df3):
    print('Same')
else:
    print('Some difference')

Same


## Activity:
Using boto3, save your df2 to a feather file on AWS S3 with the key:<P>

```
/gse580/your_username/data/df2.feather
```
    

In [9]:
# Don't forget, you have to perform 'aws configure' at a terminal CLI in SageMaker.
#
# (studiolab) studio-lab-user@default:~$ aws configure
# AWS Access Key ID [****************I4JA]: 
# AWS Secret Access Key [****************7YuJ]: 
# Default region name [us-west-2]: 
# Default output format [json]:

In [10]:
# Recall code from last class meeting to get your username:
session = boto3.Session()
sts = session.client('sts')
response = sts.get_caller_identity()
my_username = response['Arn'].split('/')[1]
print(my_username)

kcolvin


In [11]:
# Code to save df2 to feather file on AWS S3
s3c = session.client('s3')
bucket = 'gse580'
key = my_username+'/data/df2.feather'

with io.BytesIO() as ff:
    # Use the pandas to_feather() function
    df2.to_feather(ff)
    #
    # Here is the put_object function
    response = s3c.put_object(Bucket=bucket, Key=key, Body=ff.getvalue())
    #
    status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")
    #
    if status == 200:
        print(f"Successful S3 put_object response. Status - {status}")
    else:
        print(f"Unsuccessful S3 put_object response. Status - {status}")

Successful S3 put_object response. Status - 200


In [12]:
# Verify it exists:
response = s3c.list_objects(Bucket=bucket)
all_objects = response['Contents']
for obj in all_objects:
    # Search for your key in all the keys
    if key in obj['Key']:
        print('It does exist:')
        print(obj['Key'])

It does exist:
kcolvin/data/df2.feather


In [13]:
# Read the Feather File from S3 directly into a pandas DF
# bucket and key are defined above
feather_obj = s3c.get_object(Bucket=bucket, Key=key)
new_df = pd.read_feather(io.BytesIO(feather_obj['Body'].read()))
new_df.head(3)

Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"{'id': 'SP.DYN.LE00.IN', 'value': 'Life expect...","{'id': 'ZH', 'value': 'Africa Eastern and Sout...",AFE,2020,,,,0
1,"{'id': 'SP.DYN.LE00.IN', 'value': 'Life expect...","{'id': 'ZH', 'value': 'Africa Eastern and Sout...",AFE,2019,64.005197,,,0
2,"{'id': 'SP.DYN.LE00.IN', 'value': 'Life expect...","{'id': 'ZH', 'value': 'Africa Eastern and Sout...",AFE,2018,63.648988,,,0
