# Get Time Series data for WDI via  API

Documentation: 

- https://datahelpdesk.worldbank.org/knowledgebase/articles/898581-api-basic-call-structures

Basic URL:

http://api.worldbank.org/v2/sdmx/rest/data/WDI/../?startperiod=&endPeriod=



In [1]:
import requests
import time
import pandas as pd

## get inputs

Use all WDI indicators, gathered from previous query to WB

Use ISO alpha 3 codes from CShapes file, which maps CoW, UCDP/PRIO, and ISO codes for countries. Exclude any codes that are not present in the WB, determined from previous query to WB.

In [2]:
wdi_df = pd.read_csv('../Data/WorldBank/Raw_API/indicators-wdi-wtopics.csv')
i_list = list(wdi_df['id'].unique())

countries_df = pd.read_csv("../Data/countrycode/country_conversion_table.csv", usecols=['wb']).drop_duplicates().dropna()
c_list = list(countries_df['wb_id'])

wb_countries = pd.read_csv('../Data/WorldBank/Raw_API/countries_list.csv')
wb_c_list = list(wb_countries['id'].unique())

not every row in the WB country list is a country, some are aggregates. These are the observations that do not appear in the countrycode base table:

In [3]:
wb_not_in_cc = list(set(wb_c_list) - set(c_list))
wb_countries[wb_countries['id'].isin(wb_not_in_cc)]

Unnamed: 0,id,iso2code,name,capitalcity,latitude,longitude,region_name,adminregion_name,incomelevel_value,lendingtype_value
0,ABW,AW,Aruba,Oranjestad,12.516700,-70.016700,Latin America & Caribbean,,High income,Not classified
2,AFR,A9,Africa,,,,Aggregates,,Aggregates,Aggregates
6,ANR,L5,Andean Region,,,,Aggregates,,Aggregates,Aggregates
7,ARB,1A,Arab World,,,,Aggregates,,Aggregates,Aggregates
11,ASM,AS,American Samoa,Pago Pago,-14.284600,-170.691000,East Asia & Pacific,East Asia & Pacific (excluding high income),Upper middle income,Not classified
...,...,...,...,...,...,...,...,...,...,...
286,UMC,XT,Upper middle income,,,,Aggregates,,Aggregates,Aggregates
292,VGB,VG,British Virgin Islands,Road Town,18.431389,-64.623056,Latin America & Caribbean,,High income,Not classified
293,VIR,VI,Virgin Islands (U.S.),Charlotte Amalie,18.335800,-64.896300,Latin America & Caribbean,,High income,Not classified
296,WLD,1W,World,,,,Aggregates,,Aggregates,Aggregates


but, all WB codes that are in countrycodes are also in WB, so we are good to go using that list. 195 countries.

In [4]:
set(c_list) - set(wb_c_list), len(c_list)

(set(), 195)

In [5]:
len(i_list)

1387

### Top 25 indicators

The list of the top 25 most popular WDI indicators is from: https://datatopics.worldbank.org/world-development-indicators/stories/world-development-indicators-the-story.html

In [11]:
top25_indicators = ['NY.GDP.MKTP.CD', 'SP.POP.TOTL', 'NY.GDP.MKTP.KD.ZG', 'NY.GDP.PCAP.CD', 'SI.POV.GINI', 
                    'SP.POP.TOTL.FE.IN', 'SP.DYN.TFRT.IN', 'NY.GDP.PCAP.PP.CD', 'SP.DYN.LE00.IN', 'MS.MIL.XPND.GD.ZS', 
                    'FP.CPI.TOTL.ZG', 'SP.URB.TOTL.IN.ZS', 'EN.POP.DNST', 'EN.ATM.CO2E.PC', 'SH.XPD.CHEX.GD.ZS', 
                    'SP.POP.GROW', 'ST.INT.ARVL', 'NY.GDP.PCAP.KD.ZG', 'BX.KLT.DINV.CD.WD', 'SE.XPD.TOTL.GB.ZS', 
                    'SE.ADT.LITR.ZS', 'EG.ELC.ACCS.ZS', 'SL.UEM.TOTL.ZS', 'NE.EXP.GNFS.ZS', 'SP.DYN.IMRT.IN']

## function to page through results

Needs to account for errors - if the call was unsuccessful, if the WB returns an error code, etc.
If successful, returns the data. If unsuccessful, returns the call (to investigate why).

In [6]:
def page_through(baseurl):
    # keep track in case of errors
    tries = 0
    # initiate page count and dummy number of pages
    pagecount = 1
    pages = 1
    # to store results from request
    results = []
    # set initial url for first page
    url = baseurl
    while pagecount <= pages:
        # in case something is fundamentally wrong with query
        if tries > 5:
            return data_call
            break
        else:
            # attempt data request
            data_call = requests.get(url)
            # compensate for error in call
            if data_call.status_code != 200:
                tries += 1
                time.sleep(5)
                continue
            elif len(data_call.json()) < 2:
                tries += 1
                time.sleep(5)
                continue
            else:
                # reset error tracking vars
                tries = 0
                # get results if valid call
                header = data_call.json()[0]
                response = data_call.json()[1]
                # track pages and number of observations so knows when to stop
                pages = header['pages']
                total_obs = header['total']
                # add data to results (10,000 per page)
                results.extend(response)

                # increment page number to get next page
                pagecount += 1
                url = baseurl + '&page=' + str(pagecount)
                time.sleep(5)

    if len(results) == total_obs:
        return results
    else:
        print("Something went wrong")

## function to create url for queries

Each URL will include all countries, and 20 indicators.

In [7]:
def cycle_through(countries, indicators):
    
    country_list = ";".join(countries)
    current_indicator = 0
    total_indicators = len(indicators)
    
    url_list = []
    
    while current_indicator <= total_indicators:
        
        if current_indicator < total_indicators-20:
            indicator_list = ";".join(indicators[current_indicator:current_indicator+20])
        else:
            indicator_list = ";".join(indicators[current_indicator:])
        
        url = "http://api.worldbank.org/v2/country/" + country_list + "/indicator/" + indicator_list + "?source=2" + "&format=json&per_page=10000"
        url_list.append(url)
        current_indicator += 20
    
    return url_list

## create urls

In [8]:
url_list_allindicators = cycle_through(countries=c_list, indicators=i_list)
len(url_list)

70

In [12]:
url_list_top25 = cycle_through(countries=c_list, indicators=top25_indicators)
len(url_list_top25)

2

## request data and transform into a dataframe

code below uses the 'url_list_top25' list. To get the full indicator list, use the 'url_list_allindicators' variable in the for loop.

In [13]:
raw_data = []
skipped_urls = []
for url in url_list_top25:
    chunk = page_through(url)
    if type(chunk) != list:
        err_result = {url: chunk}
        skipped_urls.append(err_result)
    else:
        raw_data.extend(chunk)

In [14]:
len(raw_data)

291000

In [15]:
len(skipped_urls)

0

In [16]:
raw_data[0]

{'indicator': {'id': 'NY.GDP.MKTP.CD', 'value': 'GDP (current US$)'},
 'country': {'id': 'AF', 'value': 'Afghanistan'},
 'countryiso3code': 'AFG',
 'date': '2019',
 'value': None,
 'scale': '',
 'unit': '',
 'obs_status': '',
 'decimal': 0}

In [17]:
time_series_results = []
for r in raw_data:
    row = {'country': r['countryiso3code'], 'indicator': r['indicator']['id'], 'year': r['date'], 'value': r['value'], 
                       'unit': r['unit'], 'obs_status': r['obs_status'], 'decimal': r['decimal'], 'scale': ''}
    if 'scale' in r:
        row['scale'] = r['scale']
    
    time_series_results.append(row)

In [18]:
time_series_df = pd.DataFrame(time_series_results)
time_series_df

Unnamed: 0,country,indicator,year,value,unit,obs_status,decimal,scale
0,AFG,NY.GDP.MKTP.CD,2019,,,,0,
1,AFG,NY.GDP.MKTP.CD,2018,1.936297e+10,,,0,
2,AFG,NY.GDP.MKTP.CD,2017,2.019176e+10,,,0,
3,AFG,NY.GDP.MKTP.CD,2016,1.936264e+10,,,0,
4,AFG,NY.GDP.MKTP.CD,2015,1.990711e+10,,,0,
...,...,...,...,...,...,...,...,...
290995,ZWE,SP.DYN.IMRT.IN,1964,8.320000e+01,,,0,
290996,ZWE,SP.DYN.IMRT.IN,1963,8.570000e+01,,,0,
290997,ZWE,SP.DYN.IMRT.IN,1962,8.810000e+01,,,0,
290998,ZWE,SP.DYN.IMRT.IN,1961,9.050000e+01,,,0,


## Inspect and clean up the dataframe

In [19]:
time_series_df.columns

Index(['country', 'indicator', 'year', 'value', 'unit', 'obs_status',
       'decimal', 'scale'],
      dtype='object')

In [20]:
time_series_df['scale'].unique()

array([''], dtype=object)

In [21]:
time_series_df['unit'].unique()

array([''], dtype=object)

In [22]:
time_series_df['obs_status'].unique()

array([''], dtype=object)

In [23]:
time_series_df['decimal'].unique()

array([0, 1, 2])

In [24]:
time_series_df['year'].unique()

array(['2019', '2018', '2017', '2016', '2015', '2014', '2013', '2012',
       '2011', '2010', '2009', '2008', '2007', '2006', '2005', '2004',
       '2003', '2002', '2001', '2000', '1999', '1998', '1997', '1996',
       '1995', '1994', '1993', '1992', '1991', '1990', '1989', '1988',
       '1987', '1986', '1985', '1984', '1983', '1982', '1981', '1980',
       '1979', '1978', '1977', '1976', '1975', '1974', '1973', '1972',
       '1971', '1970', '1969', '1968', '1967', '1966', '1965', '1964',
       '1963', '1962', '1961', '1960'], dtype=object)

In [25]:
len(time_series_df['year'].unique())

60

In [26]:
time_series_df['country'].unique()

array(['AFG', 'AGO', 'ALB', 'AND', 'ARE', 'ARG', 'ARM', 'ATG', 'AUS',
       'AUT', 'AZE', 'BDI', 'BEL', 'BEN', 'BFA', 'BGD', 'BGR', 'BHR',
       'BHS', 'BIH', 'BLR', 'BLZ', 'BOL', 'BRA', 'BRB', 'BRN', 'BTN',
       'BWA', 'CAF', 'CAN', 'CHE', 'CHL', 'CHN', 'CIV', 'CMR', 'COD',
       'COG', 'COL', 'COM', 'CPV', 'CRI', 'CUB', 'CYP', 'CZE', 'DEU',
       'DJI', 'DMA', 'DNK', 'DOM', 'DZA', 'ECU', 'EGY', 'ERI', 'ESP',
       'EST', 'ETH', 'FIN', 'FJI', 'FRA', 'FSM', 'GAB', 'GBR', 'GEO',
       'GHA', 'GIN', 'GMB', 'GNB', 'GNQ', 'GRC', 'GRD', 'GTM', 'GUY',
       'HND', 'HRV', 'HTI', 'HUN', 'IDN', 'IND', 'IRL', 'IRN', 'IRQ',
       'ISL', 'ISR', 'ITA', 'JAM', 'JOR', 'JPN', 'KAZ', 'KEN', 'KGZ',
       'KHM', 'KIR', 'KNA', 'KOR', 'KWT', 'LAO', 'LBN', 'LBR', 'LBY',
       'LCA', 'LIE', 'LKA', 'LSO', 'LTU', 'LUX', 'LVA', 'MAR', 'MCO',
       'MDA', 'MDG', 'MDV', 'MEX', 'MHL', 'MKD', 'MLI', 'MLT', 'MMR',
       'MNE', 'MNG', 'MOZ', 'MRT', 'MUS', 'MWI', 'MYS', 'NAM', 'NER',
       'NGA', 'NIC',

In [27]:
len(time_series_df['country'].unique())

194

In [28]:
len(time_series_df['indicator'].unique())

25

In [30]:
set(c_list) - set(time_series_df['country'].unique())

{'TWN'}

In [31]:
len(set(c_list))

195

NOTE: no results returned for TWN, but did exist in WB country list. 

see: https://datahelpdesk.worldbank.org/knowledgebase/articles/114933-where-are-your-data-on-taiwan
and: https://datahelpdesk.worldbank.org/knowledgebase/articles/378834-how-does-the-world-bank-classify-countries

In [32]:
time_series_df = time_series_df.drop(columns = ['scale', 'unit', 'obs_status'])
time_series_df

Unnamed: 0,country,indicator,year,value,decimal
0,AFG,NY.GDP.MKTP.CD,2019,,0
1,AFG,NY.GDP.MKTP.CD,2018,1.936297e+10,0
2,AFG,NY.GDP.MKTP.CD,2017,2.019176e+10,0
3,AFG,NY.GDP.MKTP.CD,2016,1.936264e+10,0
4,AFG,NY.GDP.MKTP.CD,2015,1.990711e+10,0
...,...,...,...,...,...
290995,ZWE,SP.DYN.IMRT.IN,1964,8.320000e+01,0
290996,ZWE,SP.DYN.IMRT.IN,1963,8.570000e+01,0
290997,ZWE,SP.DYN.IMRT.IN,1962,8.810000e+01,0
290998,ZWE,SP.DYN.IMRT.IN,1961,9.050000e+01,0


In [33]:
time_series_df.duplicated(subset=['country', 'indicator', 'year']).sum()

0

## Export to csv

In [34]:
time_series_df.to_csv("../Data/FINAL/wdi_top25.csv", index=False)