## Intro Matplotlib
Author: Emmanuel Rodriguez

Date: 8 May 2022

Location: Fort Hancock, TX

emmanueljrodriguez.com

## World Development Indicators (WDI)

World Bank's compilation of global development data.

## Step 1: Initial exploration of the dataset

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

In [4]:
# Download dataset from https://datatopics.worldbank.org/world-development-indicators/?msclkid=905295b0cef211eca247786f37737c73

data = pd.read_csv('./WDI_csv/WDIData.csv') # Read .csv file into a DataFrame (multi-dimensional array w/rows and column labels)
data.shape # Get array size

(384370, 67)

This is a large dataset, let's explore what this data holds.

In [5]:
data.head(10) #Use the 'head' method to view the headers of the n rows

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,16.936004,17.337896,17.687093,18.140971,18.491344,18.82552,19.272212,19.628009,,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.499471,6.680066,6.85911,7.016238,7.180364,7.322294,7.517191,7.651598,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,37.855399,38.046781,38.326255,38.468426,38.670044,38.722783,38.927016,39.042839,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.79416,32.001027,33.87191,38.880173,40.261358,43.061877,44.27086,45.803485,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,18.663502,17.633986,16.464681,24.531436,25.345111,27.449908,29.64176,30.404935,,
5,Africa Eastern and Southern,AFE,"Access to electricity, urban (% of urban popul...",EG.ELC.ACCS.UR.ZS,,,,,,,...,67.112206,66.283426,67.080235,69.132292,70.928567,71.866136,73.332842,73.942949,,
6,Africa Eastern and Southern,AFE,Account ownership at a financial institution o...,FX.OWN.TOTL.ZS,,,,,,,...,,,,,,,,,,
7,Africa Eastern and Southern,AFE,Account ownership at a financial institution o...,FX.OWN.TOTL.FE.ZS,,,,,,,...,,,,,,,,,,
8,Africa Eastern and Southern,AFE,Account ownership at a financial institution o...,FX.OWN.TOTL.MA.ZS,,,,,,,...,,,,,,,,,,
9,Africa Eastern and Southern,AFE,Account ownership at a financial institution o...,FX.OWN.TOTL.OL.ZS,,,,,,,...,,,,,,,,,,


This is a three-dimensional dataset, where the dimensions are: country, indicator, and year.

### How many unique country names are there?

In [7]:
countries = data['Country Name'].unique().tolist() # Use the 'unique' method on the column of the dataframe that contains the country names 
len(countries)

266

### Do the number of country codes match the number of countries?

In [77]:
countryCodes = data['Country Code'].unique().tolist() # The 'unique()' method finds the unique names in the dataframe index
# 'Country Code' and returns an arrray, then the 'tolist()' method returns a list of the array values 
len(countryCodes)

266

### How many indicators are there?

In [15]:
indicators = data['Indicator Name'].unique().tolist()
len(indicators)

1445

In [18]:
# Cross-check with number of indicator codes
indicatorCodes = data['Indicator Code'].unique().tolist()
len(indicatorCodes)

1445

In [23]:
# List the environment variables created so far

%whos

Variable         Type         Data/Info
---------------------------------------
countries        list         n=266
countryCodes     list         n=266
data             DataFrame                           Co<...>384370 rows x 67 columns]
indicatorCodes   list         n=1445
indicators       list         n=1445
np               module       <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
pd               module       <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
plt              module       <module 'matplotlib.pyplo<...>\\matplotlib\\pyplot.py'>
random           module       <module 'random' from 'C:<...>aconda3\\lib\\random.py'>


### How many years of data do we have?

In [83]:
years = data.columns[4:-1] # Grab the column labels starting at col 4, then remove the last 'Unnamed' value
len(years)

62

In [85]:
print(years)

Index(['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'],
      dtype='object')


In [86]:
# Convert the index into a list
years = years.tolist()

In [88]:
print(years)

['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']


### What's the range of years?

In [89]:
print(min(years), " to " , max(years))

1960  to  2021
