# Project: Investigate Gapminder's World Development Indicators

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#questions">Questions</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
<li><a href="#limitation">Limitation</a></li>
</ul>

# Introduction <a id='intro'></a>

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

**[Gapminder](https://www.gapminder.org/)** has collected a lot of information about how people live their lives in different countries, tracked across the years, and on a number of different indicators. The complete datasets with hundreds of indicators are available in GitHub repositories, one of which is World Development Indicators (WDI). The data is organized in loose CSV files which can be consumed by any spreadsheet software. More efficient way of consuming data is by using [DDF data model](https://open-numbers.github.io/ddf.html).

DDF is used to define datasets. A dataset is a body of coherent, related data that is composed of separate elements, but can be manipulated as one unit by a computer. Each DDF dataset must have Concepts and may have DataPoints, Entities, Metadata, or Synonyms, whereas:

- **Concepts** contain information about the variables in the data set, i.e. concept properties with simple key (concept = the column-header in a tabular format).
- **DataPoints** contain multidimensional data, i.e. indicators with composite keys (dimensions).
- **Entities** contain single-dimensional data, i.e. entity properties with simple keys (entity).

What datapoints, entities, and concepts to find in what files is defined in the schema and resources sections of a `datapackage.json` file. This is the entry point for any machine that wants to explore the data set:

In [1]:
# Import dependencies
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Set path to data repozitory
path = 'https://raw.githubusercontent.com/open-numbers/ddf--open_numbers--world_development_indicators/master/'
# Read datapackage.json with data descriptions
datapkg = requests.get(path + 'datapackage.json').json()

# Print key information about a data set
for k,v in datapkg.items():
  if type(v) == str or len(v) < 3:
    print(k, ' :\t', v)
  else:
    print(k, ' : section contains ', len(v), 'elements.')

name  :	 ddf--gapminder--world_development_indicators
language  :	 {'name': 'English', 'id': 'en'}
title  :	 Gapminder's World Development Indicators
description  :	 Gapminder's World Development Indicators
author  :	 Gapminder
license  :	 MIT
created  :	 2021-08-22T12:15:33.269367+00:00
translations  :	 []
version  :	 0.0.1
resources  : section contains  2402 elements.
ddfSchema  : section contains  4 elements.


This data analysis investigates two dependent variables (indicators) related to countries:
- **Research and development expenditure (% of GDP)**
- **Total population**

Key description of Research and development expenditure (concept `gb_xpd_rsdv_gd_zs`) and Total population (`sp_pop_totl`) is provided by `ddf--concepts--continuous.csv` file:

In [4]:
import textwrap # To wrap long texts

def get_concept_info(repository, indicator):
  """Prints key information about an indicator retrieved from a repository."""

  print('\nConcept:', indicator)
  print('==========================')
  for v in ['name', 'long_definition', 'statistical_concept_and_methodology', 'source']:
    text = v.replace('_', ' ').capitalize() + ' : ' + str(repository.at[indicator, v])
    for line in textwrap.wrap(text, width=80):
      print(line.replace('\\n', '\n'))
    print()
  
  return None

# Load descriptions of continuous concepts
wdic_cont = pd.read_csv(path + 'ddf--concepts--continuous.csv', index_col='concept')

# Choose indicators for further analysis
indicators = ['gb_xpd_rsdv_gd_zs', 'sp_pop_totl']

# Print key information about indicators
for i in indicators:
  get_concept_info(wdic_cont, i)


Concept: gb_xpd_rsdv_gd_zs
Name : Research and development expenditure (% of GDP)

Long definition : Gross domestic expenditures on research and development (R&D),
expressed as a percent of GDP. They include both capital and current
expenditures in the four main sectors: Business enterprise, Government, Higher
education and Private non-profit. R&D covers basic research, applied research,
and experimental development.

Statistical concept and methodology : The gross domestic expenditure on R&D
indicator consists of the total expenditure (current and capital) on R&D by all
resident companies, research institutes, university and government laboratories,
etc. It excludes R&D expenditures financed by domestic firms but performed
abroad. 

The OECD's Frascati Manual defines research and experimental
development as "creative work undertaken on a systemic basis in order to
increase the stock of knowledge, including knowledge of man, culture and
society, and the use of this stock of knowledge 

# Questions <a id='questions'></a>
According to [this data story](https://www.weforum.org/agenda/2020/11/countries-spending-research-development-gdp/), investment in research and development is the lifeblood of many private sector organizations, helping bring new products and services to market. It's also important to national economies and plays a crutial role in GDP growth. Let us focus on two key questions:

1. Which countries invest most in R&D?
2. Is there a relationship between R&D expenditures and country's population?

# Data Wrangling <a id='wrangling'></a>

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

## Load Data Set
An advantage of using a data repozitory organized according to DDF data model is easy loading of clean data. For example, missing data are avoided by 'inner' method of datapoints concatenation. The steps taken to load and check data for analysis are decribed by inline comments in the next cell. Key information about data wrangling privides a log in the next cell's output.

In [5]:
def get_datapoint(path, indicator, dim = 'geo--time'):
  """Reads a datapoint for a given indicator and dimensions.

  Args:
    path : A path to Gapminder World Development Indicator repository.
    indicator : A concept name of an indicator.
    dim : Required dimensions separeted by '--'.

  Returns:
    A pandas.DataFrame indexed by dimensions (or prints non-existent path).
  """

  # Compile the path to CSV file
  path = path + 'datapoints/ddf--datapoints--' + indicator + '--by--' + dim + '.csv'
  index_col = dim.split('--')
  # Either a read csv file or print the incorrect path to check
  try:
    df = pd.read_csv(path, index_col=index_col)
    print(f'Loading {indicator} ...')
    print('Shape: ', df.shape)
    print('===== Head of a DataFrame ====')
    print(df.head(), '\n==============================\n')
    return df
  except:
    print(f'No such file at path\n{path}')
    return None


# Load research and development data
df_rad_exp = get_datapoint(path, 'gb_xpd_rsdv_gd_zs')

# Load total population data
pop_totl = get_datapoint(path, 'sp_pop_totl')

# Concatenate R&D data with GDP growth data where both values are available
df = pd.concat([df_rad_exp, pop_totl], axis=1, join='inner').reset_index()

# Shorten names of indicators
df.rename(columns={'gb_xpd_rsdv_gd_zs': 'rad_exp', 'sp_pop_totl': 'pop_totl'}, inplace=True)

# Load country data
countries = pd.read_csv(path + 'ddf--entities--geo--country.csv', index_col='country')

# Add country data into the dataframe
for col in ['name', 'income_groups', 'world_4region']:
  df[col] = df.geo.apply(lambda c: countries.at[c, col])

# Check duplicates and missing values
assert df.duplicated().sum() == 0
print('No duplicated observations.')
assert df.isna().any().any() == False
print('No missing values.')
print('Shape after concatenation and extension :', df.shape)
print('Resulting data set including country data:')
df.head()

Loading gb_xpd_rsdv_gd_zs ...
Shape:  (1527, 1)
===== Head of a DataFrame ====
          gb_xpd_rsdv_gd_zs
geo time                   
alb 2007            0.08737
    2008            0.15412
are 2011            0.48920
    2014            0.70000
arg 1996            0.41749 

Loading sp_pop_totl ...
Shape:  (13195, 1)
===== Head of a DataFrame ====
          sp_pop_totl
geo time             
abw 1960        54208
    1961        55434
    1962        56234
    1963        56699
    1964        57029 

No duplicated observations.
No missing values.
Shape after concatenation and extension : (1527, 7)
Resulting data set including country data:


Unnamed: 0,geo,time,rad_exp,pop_totl,name,income_groups,world_4region
0,alb,2007,0.08737,2970017,Albania,upper_middle_income,europe
1,alb,2008,0.15412,2947314,Albania,upper_middle_income,europe
2,are,2011,0.4892,8946778,United Arab Emirates,high_income,asia
3,are,2014,0.7,9214182,United Arab Emirates,high_income,asia
4,arg,1996,0.41749,35246376,Argentina,upper_middle_income,americas
