## **Data Gathering for INFO 4300 Project from World Bank Data Sets**

Data sets accessed through `wbgapi 1.0.7` that provides modern, pythonic access to the World Bank's data API.  Install using `pip install wbgapi`  
[https://pypi.org/project/wbgapi/](https://pypi.org/project/wbgapi/)

In [1]:
# Import World Bank data through API
import wbgapi as wb
import pandas as pd 
import numpy as np 

In [2]:
# Begin EDA by exploring topics available through API
wb.topic.info()

id,value
1.0,Agriculture & Rural Development
2.0,Aid Effectiveness
3.0,Economy & Growth
4.0,Education
5.0,Energy & Mining
6.0,Environment
7.0,Financial Sector
8.0,Health
9.0,Infrastructure
10.0,Social Protection & Labor


In [24]:
# Individual data series can be reviewed by topic
wb.series.info(topic = 11)

id,value
SI.SPR.PC40.ZG,"Annualized average growth rate in per capita real survey mean consumption or income, bottom 40% of population (%)"
SI.SPR.PCAP.ZG,"Annualized average growth rate in per capita real survey mean consumption or income, total population (%)"
SI.POV.GINI,Gini index (World Bank estimate)
SI.DST.04TH.20,Income share held by fourth 20%
SI.DST.10TH.10,Income share held by highest 10%
SI.DST.05TH.20,Income share held by highest 20%
SI.DST.FRST.10,Income share held by lowest 10%
SI.DST.FRST.20,Income share held by lowest 20%
SI.DST.02ND.20,Income share held by second 20%
SI.DST.03RD.20,Income share held by third 20%


In [29]:
# Pulls 27 selected columns from wbgapi. Takes nearly 30 mins to run.
raw_data = wb.data.DataFrame(['EG.ELC.ACCS.ZS', 'EG.ELC.COAL.ZS', 'EG.ELC.FOSL.ZS', 'EG.ELC.HYRO.ZS', 'EG.ELC.LOSS.ZS', 'EG.ELC.NGAS.ZS', 
                                'EG.ELC.NUCL.ZS', 'EG.ELC.PETR.ZS', 'EG.ELC.PROD.KH', 'EG.ELC.RNWX.ZS', 'EG.USE.ELEC.KH', 'NE.EXP.GNFS.ZS', 
                                'NE.IMP.GNFS.ZS', 'NV.AGR.TOTL.ZS', 'NV.IND.MANF.ZS', 'NV.IND.TOTL.ZS', 'NV.SRV.TOTL.ZS', 'SL.TLF.TOTL.FE.ZS', 
                                'SL.TLF.TOTL.IN', 'SP.POP.TOTL', 'NY.GDP.MKTP.CD', 'EN.ATM.GHGT.KT.CE', 'EN.ATM.METH.KT.CE', 'EN.CO2.ETOT.MT',
                                'SP.DYN.LE00.FE.IN', 'SP.URB.TOTL', 'SP.DYN.TFRT.IN', 'IC.LGL.CRED.XQ', 'SI.POV.MDIM', 'SI.POV.GINI'], 
                                time = range(1970, 2021), skipBlanks = True, columns = 'series')
print(raw_data.shape)
raw_data.head()


(13495, 27)


Unnamed: 0_level_0,Unnamed: 1_level_0,EG.ELC.ACCS.ZS,EG.ELC.COAL.ZS,EG.ELC.FOSL.ZS,EG.ELC.HYRO.ZS,EG.ELC.LOSS.ZS,EG.ELC.NGAS.ZS,EG.ELC.NUCL.ZS,EG.ELC.PETR.ZS,EG.ELC.RNWX.ZS,EN.ATM.GHGT.KT.CE,...,NV.SRV.TOTL.ZS,NY.GDP.MKTP.CD,SI.POV.GINI,SI.POV.MDIM,SL.TLF.TOTL.FE.ZS,SL.TLF.TOTL.IN,SP.DYN.LE00.FE.IN,SP.DYN.TFRT.IN,SP.POP.TOTL,SP.URB.TOTL
economy,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
ABW,YR1970,,,,,,,,,,42.306298,...,,,,,,,71.306,2.908,59070.0,29904.0
ABW,YR1971,,,,,,,,,,42.786948,...,,,,,,,71.815,2.788,59442.0,30083.0
ABW,YR1972,,,,,,,,,,43.286613,...,,,,,,,72.313,2.691,59849.0,30279.0
ABW,YR1973,,,,,,,,,,43.72459,...,,,,,,,72.779,2.613,60236.0,30466.0
ABW,YR1974,,,,,,,,,,44.130957,...,,,,,,,73.204,2.552,60527.0,30604.0


In [30]:
# Save data to csv to avoid the 30 min duration to pull data directly from the api. Will be used in separate analysis notebook for project.
raw_data.to_csv('wb_raw_data.csv')