# GDP per capita

- **Data Source**: [The Worldbank Indicator NY.GDP.PCAP.CD](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
- **Licence for data**: [CC BY-4.0](https://datacatalog.worldbank.org/public-licenses#cc-by)

## Goals

1. Evaluate GDP data from Worldbank
2. Prepare them for further usage in other projects

## Steps

### Download data - manually

- Open page [The Worldbank Indicator NY.GDP.PCAP.CD](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
- Click on [Download CSV](https://api.worldbank.org/v2/en/indicator/NY.GDP.PCAP.CD?downloadformat=csv)
- Download file in '_data' subfolder (filename `API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.zip` as of 2024-04-09)
- Unzip it there
- You should get 3 files in `_data\API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133` folder:
  ``` bash
  API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.csv
  Metadata_Country_API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.csv
  Metadata_Indicator_API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.csv
  
  ```
- `API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.csv` is the file that contains data


### Download data - Python

- Tutorial: [Real Python: How to Download Files From URLs With Python](https://realpython.com/python-download-file-from-url/)
  - I will use the version with `requests` library in a [streaming fashion](https://realpython.com/python-download-file-from-url/#downloading-a-large-file-in-a-streaming-fashion) 
- Prerequisites: install requests (either by `pip` or `conda`)

In [36]:
import requests
from pathlib import Path
import zipfile

In [37]:
url = 'https://api.worldbank.org/v2/en/indicator/NY.GDP.PCAP.CD'
query_parameters = {'downloadformat':'csv'}

In [38]:
response = requests.get(url, params=query_parameters, stream=True)
response.headers

{'Date': 'Tue, 09 Apr 2024 16:04:19 GMT', 'Content-Type': 'application/zip', 'Content-Length': '133072', 'Connection': 'keep-alive', 'set-cookie': 'api_https.cookieCORS=76a6c6567ab12cea5dac4942d8df71cc; Path=/; SameSite=None; Secure, api_https.cookie=76a6c6567ab12cea5dac4942d8df71cc; Path=/, __cf_bm=Pf8iDPX9LgVt6n3CUsIT2Od_5CySIyJZJvdsIeP_Zsk-1712678659-1.0.1.1-FJwxCezy0MeBT1NRGnf25lL2t7Y9fvD0pZiYjDWBjvIICs3nAGVFH9cobqpAaG_OZD0VKH0GrkbRURtOgJJKKA; path=/; expires=Tue, 09-Apr-24 16:34:19 GMT; domain=.worldbank.org; HttpOnly; SameSite=None, TS019865bf=01689d3836fe92c1e9e9e639a0297d161b9305a55ac5f2008f54eae2e225adf1406ce811410b04ca2ce19705e41624793945a55e39; Path=/;, TS0154aeeb=01689d3836ec7c7e4fee26658c3452db02eb343e37c5f2008f54eae2e225adf1406ce81141de2b38892132b6fd6ff3032a802957ccd2151405227142b2284328decda4a673; path=/; domain=.worldbank.org, __cf_bm=JunCrMlD8m1HAOKNVI9ULIb91icBacG_s7rlzj1C57A-1712678659-1.0.1.1-9395vJqICzlExzvjxwAjhMYvUIxrdnYf.hy1oGL.fFb_Q131FkcXieUeTEb0ScdS3ElXCTyxw4

In [39]:
assert(response.ok)
assert(response.status_code == 200)
assert(response.headers['Content-Type'] =='application/zip')
assert(response.headers['Connection'] =='keep-alive')
assert('content-disposition' in response.headers)
assert('filename' in response.headers['content-disposition'].split('=')[0])
assert(response.headers['content-disposition'].split('=')[1])



In [40]:
zip_fname = response.headers['content-disposition'].split('=')[1]
zip_fname

'API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.zip'

In [41]:
# from pathlib import Path # -> moved to import section
path_prefix = '_data\\' # might be better to use
zip_full_name = Path(path_prefix, zip_fname)
zip_full_name


WindowsPath('_data/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.zip')

In [42]:
with open(zip_full_name, mode="wb") as file:
    for chunk in response.iter_content(chunk_size=10 * 1024):
        file.write(chunk)

### Unzip file - Python

- See [ZipFile class](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.extractall)

In [46]:
# zip_full_name.suffix
# zip_full_name.stem is the filename without suffix .zip, it can serve as a folder name
unzip_to = Path(path_prefix, zip_full_name.stem)
unzip_to

WindowsPath('_data/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133')

In [49]:
# import zipfile # # -> moved to import section
file_list = ''
with zipfile.ZipFile(zip_full_name, 'r') as zip_ref:
    zip_ref.extractall(unzip_to)
    file_list = zip_ref.filelist

file_list

[<ZipInfo filename='Metadata_Indicator_API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.csv' compress_type=deflate file_size=612 compress_size=368>,
 <ZipInfo filename='API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.csv' compress_type=deflate file_size=280252 compress_size=123964>,
 <ZipInfo filename='Metadata_Country_API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.csv' compress_type=deflate file_size=59443 compress_size=8094>]

In [57]:
data_files = [f for f in file_list if 'Metadata' not in f.filename and Path(f.filename).suffix == '.csv']
assert(len(data_files)==1)
data_full_name = Path(unzip_to, data_files[0].filename)
data_full_name


WindowsPath('_data/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_133.csv')

###  Evaluate data

- We have the csv file finally, so we can open it manually in OpenOffice/Excel/Google Sheet
- The first 4 rows contain info only, they should be skipped
- 5th row contains header
- Each year is in new column (from column **4**)
- Note: If I were to download the file manually, I could choose interactive dashboard and specify table format to have years as value rather than column.

### Decision

- Transform original file to a new one with following structure:
  
  | COUNTRY_NAME | COUNTRY_CODE | GDP_YEAR | GDP_VALUE |
  |---|---|---|---|
      

In [61]:
import pandas as pd
df_orig = pd.read_csv(data_full_name, skiprows=4)
df_orig

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
0,Aruba,ABW,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,28419.264534,28449.712946,29329.081747,30918.483584,31902.809818,24008.127822,29127.759384,33300.838819,,
1,Africa Eastern and Southern,AFE,GDP per capita (current US$),NY.GDP.PCAP.CD,141.385955,144.342434,148.774835,157.047580,166.849791,177.769086,...,1554.167299,1444.003514,1625.286236,1558.307482,1507.982881,1355.805923,1545.613215,1644.062829,,
2,Afghanistan,AFG,GDP per capita (current US$),NY.GDP.PCAP.CD,62.369375,62.443703,60.950364,82.021738,85.511073,105.243196,...,566.881133,523.053012,526.140801,492.090632,497.741429,512.055098,355.777826,,,
3,Africa Western and Central,AFW,GDP per capita (current US$),NY.GDP.PCAP.CD,107.053706,112.128417,117.814663,122.370114,130.700278,137.301801,...,1882.264038,1648.762676,1590.277754,1735.374911,1812.446822,1688.075575,1766.943618,1785.312219,,
4,Angola,AGO,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,3217.339244,1809.709377,2439.374441,2540.508878,2191.347764,1450.905112,1927.474078,3000.444231,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,Kosovo,XKX,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,3520.782075,3759.472855,4009.353811,4384.188680,4416.029253,4310.934002,5269.783901,5340.268798,,
262,"Yemen, Rep.",YEM,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,1488.416269,1069.816998,893.716494,701.714869,693.816504,578.512010,543.637538,650.272218,,
263,South Africa,ZAF,GDP per capita (current US$),NY.GDP.PCAP.CD,529.561923,543.042224,560.699394,601.599951,642.688431,681.131111,...,6204.929901,5735.066787,6734.475153,7067.724165,6702.526617,5753.066494,7073.612754,6766.481254,,
264,Zambia,ZMB,GDP per capita (current US$),NY.GDP.PCAP.CD,228.567399,216.274674,208.562685,209.453362,236.941713,296.022427,...,1307.909649,1249.923143,1495.752138,1475.199883,1268.120941,956.831729,1134.713454,1456.901570,,


In [62]:
df_orig.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       '2023', 'Unnamed: 68'],
      dtype='object')

In [69]:
df_orig[df_orig['1960'].notna()][['Country Name', 'Country Code', '1960']]

Unnamed: 0,Country Name,Country Code,1960
1,Africa Eastern and Southern,AFE,141.385955
2,Afghanistan,AFG,62.369375
3,Africa Western and Central,AFW,107.053706
13,Australia,AUS,1810.597443
14,Austria,AUT,935.460427
...,...,...,...
254,"Venezuela, RB",VEN,939.560806
259,World,WLD,455.598621
263,South Africa,ZAF,529.561923
264,Zambia,ZMB,228.567399


In [81]:
to_be_added = list()
for colname in df_orig.columns[4:]:
    try:
        yyyy = int(colname)
    except ValueError:
        continue
    else:
        # print(yyyy)
        df_year = df_orig[df_orig[colname].notna()][['Country Name', 'Country Code', colname]]
        df_year.rename(columns = {colname:'GDP_VALUE'}, inplace = True)
        df_year['GDP_YEAR'] = yyyy
        to_be_added.append(df_year)
        # print(df_year)

full_table = pd.concat(to_be_added, ignore_index=True)
full_table.rename(columns={'Country Name': 'COUNTRY_NAME', 'Country Code': 'COUNTRY_CODE'}, inplace = True)
full_table

Unnamed: 0,COUNTRY_NAME,COUNTRY CODE,GDP_VALUE,GDP_YEAR
0,Africa Eastern and Southern,AFE,141.385955,1960
1,Afghanistan,AFG,62.369375,1960
2,Africa Western and Central,AFW,107.053706,1960
3,Australia,AUS,1810.597443,1960
4,Austria,AUT,935.460427,1960
...,...,...,...,...
13197,Kosovo,XKX,5340.268798,2022
13198,"Yemen, Rep.",YEM,650.272218,2022
13199,South Africa,ZAF,6766.481254,2022
13200,Zambia,ZMB,1456.901570,2022


### Export it back to CSV

In [82]:
output_fname = 'GDP_PER_CAPITA.csv'
output_full_path = Path(path_prefix, output_fname)
full_table.to_csv(output_full_path, index = False)

## And we are done (or not?) - Alternative solution

So instead of creating one file with redundant information about `COUNTRY_NAME`, I can create two files:

- `GDP_PER_CAPITA_COUNTRY_CODE.csv`
  
  | COUNTRY_CODE | GDP_YEAR | GDP_VALUE |
  |---|---|---|  
  
  
- `COUNTRY_CODES.csv`
  
  | COUNTRY_CODE | COUNTRY_NAME |
  |---|---|  

In [92]:
df_orig = pd.read_csv(data_full_name, skiprows=4)
df_orig.rename(columns={'Country Name': 'COUNTRY_NAME', 'Country Code': 'COUNTRY_CODE'}, inplace = True)
# df_orig.set_index('COUNTRY_CODE', inplace = True)
df_orig


Unnamed: 0,COUNTRY_NAME,COUNTRY_CODE,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
0,Aruba,ABW,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,28419.264534,28449.712946,29329.081747,30918.483584,31902.809818,24008.127822,29127.759384,33300.838819,,
1,Africa Eastern and Southern,AFE,GDP per capita (current US$),NY.GDP.PCAP.CD,141.385955,144.342434,148.774835,157.047580,166.849791,177.769086,...,1554.167299,1444.003514,1625.286236,1558.307482,1507.982881,1355.805923,1545.613215,1644.062829,,
2,Afghanistan,AFG,GDP per capita (current US$),NY.GDP.PCAP.CD,62.369375,62.443703,60.950364,82.021738,85.511073,105.243196,...,566.881133,523.053012,526.140801,492.090632,497.741429,512.055098,355.777826,,,
3,Africa Western and Central,AFW,GDP per capita (current US$),NY.GDP.PCAP.CD,107.053706,112.128417,117.814663,122.370114,130.700278,137.301801,...,1882.264038,1648.762676,1590.277754,1735.374911,1812.446822,1688.075575,1766.943618,1785.312219,,
4,Angola,AGO,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,3217.339244,1809.709377,2439.374441,2540.508878,2191.347764,1450.905112,1927.474078,3000.444231,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,Kosovo,XKX,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,3520.782075,3759.472855,4009.353811,4384.188680,4416.029253,4310.934002,5269.783901,5340.268798,,
262,"Yemen, Rep.",YEM,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,1488.416269,1069.816998,893.716494,701.714869,693.816504,578.512010,543.637538,650.272218,,
263,South Africa,ZAF,GDP per capita (current US$),NY.GDP.PCAP.CD,529.561923,543.042224,560.699394,601.599951,642.688431,681.131111,...,6204.929901,5735.066787,6734.475153,7067.724165,6702.526617,5753.066494,7073.612754,6766.481254,,
264,Zambia,ZMB,GDP per capita (current US$),NY.GDP.PCAP.CD,228.567399,216.274674,208.562685,209.453362,236.941713,296.022427,...,1307.909649,1249.923143,1495.752138,1475.199883,1268.120941,956.831729,1134.713454,1456.901570,,


In [93]:
output_fname = 'COUNTRY_CODES.csv'
output_full_path = Path(path_prefix, output_fname)
df_orig[['COUNTRY_CODE', 'COUNTRY_NAME']].to_csv(output_full_path, index=False)


In [95]:
cols_to_be_added = list()
for colname in df_orig.columns[4:]:
    try:
        yyyy = int(colname)
    except ValueError:
        continue
    else:
        cols_to_be_added.append(colname)


In [104]:
df_melted = pd.melt(df_orig, id_vars=['COUNTRY_CODE'], value_vars=cols_to_be_added, var_name='GDP_YEAR', value_name='GDP_VALUE')
# GDP_YEAR | GDP_VALUE
df_melted

Unnamed: 0,COUNTRY_CODE,GDP_YEAR,GDP_VALUE
0,ABW,1960,
1,AFE,1960,141.385955
2,AFG,1960,62.369375
3,AFW,1960,107.053706
4,AGO,1960,
...,...,...,...
17019,XKX,2023,
17020,YEM,2023,
17021,ZAF,2023,
17022,ZMB,2023,


The `df_melted` now contains null values for years that were not filled in the original file. The question is if it's desired behavior or if we should drop these rows.

In [105]:
df_melted.dropna(inplace=True, ignore_index=True)
df_melted

Unnamed: 0,COUNTRY_CODE,GDP_YEAR,GDP_VALUE
0,AFE,1960,141.385955
1,AFG,1960,62.369375
2,AFW,1960,107.053706
3,AUS,1960,1810.597443
4,AUT,1960,935.460427
...,...,...,...
13197,XKX,2022,5340.268798
13198,YEM,2022,650.272218
13199,ZAF,2022,6766.481254
13200,ZMB,2022,1456.901570


In [106]:
output_fname = 'GDP_PER_CAPITA_COUNTRY_CODE.csv'
output_full_path = Path(path_prefix, output_fname)
df_melted.to_csv(output_full_path, index=False)

## 3rd option

I can also create one file only with redundant country names by the same method as in previous chapter.

In [107]:
df_orig = pd.read_csv(data_full_name, skiprows=4)
df_orig.rename(columns={'Country Name': 'COUNTRY_NAME', 'Country Code': 'COUNTRY_CODE'}, inplace = True)

In [108]:
cols_to_be_added = list()
for colname in df_orig.columns[4:]:
    try:
        yyyy = int(colname)
    except ValueError:
        continue
    else:
        cols_to_be_added.append(colname)

In [111]:
df_melted = pd.melt(df_orig, id_vars=['COUNTRY_CODE', 'COUNTRY_NAME'], value_vars=cols_to_be_added, var_name='GDP_YEAR',
                    value_name='GDP_VALUE')
df_melted.dropna(inplace=True, ignore_index=True)
df_melted

Unnamed: 0,COUNTRY_CODE,COUNTRY_NAME,GDP_YEAR,GDP_VALUE
0,AFE,Africa Eastern and Southern,1960,141.385955
1,AFG,Afghanistan,1960,62.369375
2,AFW,Africa Western and Central,1960,107.053706
3,AUS,Australia,1960,1810.597443
4,AUT,Austria,1960,935.460427
...,...,...,...,...
13197,XKX,Kosovo,2022,5340.268798
13198,YEM,"Yemen, Rep.",2022,650.272218
13199,ZAF,South Africa,2022,6766.481254
13200,ZMB,Zambia,2022,1456.901570


In [112]:
output_fname = 'GDP_PER_CAPITA.csv'
output_full_path = Path(path_prefix, output_fname)
df_melted.to_csv(output_full_path, index=False)