# Countries of the World

By Krzysztof Satola from: [github.com/ksatola](https://github.com/ksatola).

Based on CRISP-DM (Cross Industry Process for Data Mining).

## Business Understanding

### Dataset Dictionary

- **country** - country name
- **region** - region name
- **population** - number of people within country
- **area** - area in sq. mi.
- **popdensity** - population density per sq. mi.
- **coast** - coastline (coast/area ratio)
- **netmigr** - net migration. The net migration rate is the difference between the number of immigrants (people coming into an area) and the number of emigrants (people leaving an area) throughout the year. When the number of immigrants is larger than the number of emigrants, a positive net migration rate occurs. A positive net migration rates indicates that there are more people entering than leaving an area. When more emigrate from a country, the result is a negative net migration rate, meaning that more people are leaving than entering the area. When there is an equal number of immigrants and emigrants, the net migration rate is balanced ([source](https://en.wikipedia.org/wiki/Net_migration_rate)).
- **infmortality** - infant mortality (per 1000 births)
- **gdp** - gross domestic product (GDP) in $ per capita. The value of all final goods and services produced within a nation in a given year (2013), converted at market exchange rates to current U.S. dollars, divided by the average population for the same year ([source](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita)).
- **literacy** - literacy level in %
- **phones** - phones per 1000
- **arable** - percent of arable areas
- **crops** - percent of cropland used to grow food
- **other** - other (%) ~????????????????~
- **climate** - climate type
- **birthrate** - the birth rate (technically, births/population rate), the total number of live births per 1,000 in a population in 2013 ([source](https://en.wikipedia.org/wiki/Birth_rate)).
- **deathrate** - number of deaths in units of deaths per 1,000 individuals ([source](https://en.wikipedia.org/wiki/Mortality_rate)).
- **agriculture** - percentage of GDP sector composition ratio for agriculture economy sector ([source](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_sector_composition)). Agriculture % + Industry % + Service = 100% of GDP 
- **industry** - percentage of GDP sector composition ratio for industry economy sector
- **service** - percentage of GDP sector composition ratio for service economy sector

### Objectives

In this project, I explore [Countries of the Worlds Kaggle dataset](https://www.kaggle.com/fernandol/countries-of-the-world) to answer the following questions:

1. How different regions compared to each other in terms of area, population, population density
2. 
3. What are the most significant predictors determining country's GDP per capita, the key indicator of economic development of any country?

Next, I write a post to ??? about my findings.

## Data Understanding

In [112]:
# Import required libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Random state
rstate = 123

pd.options.display.float_format = '{:20.2f}'.format

In [11]:
%load_ext version_information

In [14]:
# Document versions of used libraries
%version_information numpy, pandas, matplotlib, seaborn

Software,Version
Python,3.7.1 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython,7.2.0
OS,Darwin 18.2.0 x86_64 i386 64bit
numpy,1.16.2
pandas,0.23.4
matplotlib,3.0.2
seaborn,0.9.0
Tue Mar 19 22:36:10 2019 CET,Tue Mar 19 22:36:10 2019 CET


In [91]:
# Load data from a CSV file
df = pd.read_csv('./data/countries of the world.csv', decimal=',')

In [92]:
# Initial look into the dataset
df.head().T

Unnamed: 0,0,1,2,3,4
Country,Afghanistan,Albania,Algeria,American Samoa,Andorra
Region,ASIA (EX. NEAR EAST),EASTERN EUROPE,NORTHERN AFRICA,OCEANIA,WESTERN EUROPE
Population,31056997,3581655,32930091,57794,71201
Area (sq. mi.),647500,28748,2381740,199,468
Pop. Density (per sq. mi.),48,124.6,13.8,290.4,152.1
Coastline (coast/area ratio),0,1.26,0.04,58.29,0
Net migration,23.06,-4.93,-0.39,-20.71,6.6
Infant mortality (per 1000 births),163.07,21.52,31,9.27,4.05
GDP ($ per capita),700,4500,6000,8000,19000
Literacy (%),36,86.5,70,97,100


In [93]:
df.tail().T

Unnamed: 0,222,223,224,225,226
Country,West Bank,Western Sahara,Yemen,Zambia,Zimbabwe
Region,NEAR EAST,NORTHERN AFRICA,NEAR EAST,SUB-SAHARAN AFRICA,SUB-SAHARAN AFRICA
Population,2460492,273008,21456188,11502010,12236805
Area (sq. mi.),5860,266000,527970,752614,390580
Pop. Density (per sq. mi.),419.9,1,40.6,15.3,31.3
Coastline (coast/area ratio),0,0.42,0.36,0,0
Net migration,2.98,,0,0,0
Infant mortality (per 1000 births),19.62,,61.5,88.29,67.69
GDP ($ per capita),800,,800,800,1900
Literacy (%),,,50.2,80.6,90.7


In [94]:
df.sample(5, random_state=rstate).T

Unnamed: 0,125,122,156,150,79
Country,Malawi,Macau,Pakistan,Nicaragua,Greece
Region,SUB-SAHARAN AFRICA,ASIA (EX. NEAR EAST),ASIA (EX. NEAR EAST),LATIN AMER. & CARIB,WESTERN EUROPE
Population,13013926,453125,165803560,5570129,10688058
Area (sq. mi.),118480,28,803940,129494,131940
Pop. Density (per sq. mi.),109.8,16183,206.2,43,81
Coastline (coast/area ratio),0,146.43,0.13,0.7,10.37
Net migration,0,4.86,-2.77,-1.22,2.35
Infant mortality (per 1000 births),103.32,4.39,72.44,29.11,5.53
GDP ($ per capita),600,19400,2100,2300,20000
Literacy (%),62.7,94.5,45.7,67.5,97.5


In [95]:
# Dataset size
df.shape

(227, 20)

## Prepare Data

In [96]:
# Variables
df.columns

Index(['Country', 'Region', 'Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)',
       'Crops (%)', 'Other (%)', 'Climate', 'Birthrate', 'Deathrate',
       'Agriculture', 'Industry', 'Service'],
      dtype='object')

In [97]:
# Naming convention, simplify column names and build a dataset dictionary (see above)
df.rename(columns={"Country":"country", 
                  "Region":"region", 
                  "Population":"population", 
                  "Area (sq. mi.)":"area", 
                  "Pop. Density (per sq. mi.)":"popdensity", 
                  "Coastline (coast/area ratio)":"coast", 
                  "Net migration":"netmigr", 
                  "Infant mortality (per 1000 births)":"infmortality", 
                  "GDP ($ per capita)":"gdp", 
                  "Literacy (%)":"literacy", 
                  "Phones (per 1000)":"phones", 
                  "Arable (%)":"arable", 
                  "Crops (%)":"crops", 
                  "Other (%)":"other", 
                  "Climate":"climate", 
                  "Birthrate":"birthrate", 
                  "Deathrate":"deathrate", 
                  "Agriculture":"agriculture", 
                  "Industry":"industry", 
                  "Service":"service"}, inplace=True)

In [98]:
df.columns

Index(['country', 'region', 'population', 'area', 'popdensity', 'coast',
       'netmigr', 'infmortality', 'gdp', 'literacy', 'phones', 'arable',
       'crops', 'other', 'climate', 'birthrate', 'deathrate', 'agriculture',
       'industry', 'service'],
      dtype='object')

In [99]:
# Examplary country data
df.iloc[163]

country                                     Poland 
region          EASTERN EUROPE                     
population                                 38536869
area                                         312685
popdensity                                    123.3
coast                                          0.16
netmigr                                       -0.49
infmortality                                   8.51
gdp                                           11100
literacy                                       99.8
phones                                        306.3
arable                                        45.91
crops                                          1.12
other                                         52.97
climate                                           3
birthrate                                      9.85
deathrate                                      9.89
agriculture                                    0.05
industry                                      0.311
service     

In [100]:
# Is there duplicated data in the dataset?
df.duplicated().mean()

0.0

In [101]:
# Country name can be treated as an unique identifier (no duplicated rows)
df.country.value_counts().mean()

1.0

In [102]:
# What are the dataset column data types?
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
country         227 non-null object
region          227 non-null object
population      227 non-null int64
area            227 non-null int64
popdensity      227 non-null float64
coast           227 non-null float64
netmigr         224 non-null float64
infmortality    224 non-null float64
gdp             226 non-null float64
literacy        209 non-null float64
phones          223 non-null float64
arable          225 non-null float64
crops           225 non-null float64
other           225 non-null float64
climate         205 non-null float64
birthrate       224 non-null float64
deathrate       223 non-null float64
agriculture     212 non-null float64
industry        211 non-null float64
service         212 non-null float64
dtypes: float64(16), int64(2), object(2)
memory usage: 35.5+ KB


In [103]:
# Make the strings categorical
df.country = df.country.astype('category')
df.region = df.region.astype('category')

In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
country         227 non-null category
region          227 non-null category
population      227 non-null int64
area            227 non-null int64
popdensity      227 non-null float64
coast           227 non-null float64
netmigr         224 non-null float64
infmortality    224 non-null float64
gdp             226 non-null float64
literacy        209 non-null float64
phones          223 non-null float64
arable          225 non-null float64
crops           225 non-null float64
other           225 non-null float64
climate         205 non-null float64
birthrate       224 non-null float64
deathrate       223 non-null float64
agriculture     212 non-null float64
industry        211 non-null float64
service         212 non-null float64
dtypes: category(2), float64(16), int64(2)
memory usage: 44.8 KB


In [105]:
# What regions do we have?
df.region.value_counts()

SUB-SAHARAN AFRICA                     51
LATIN AMER. & CARIB                    45
WESTERN EUROPE                         28
ASIA (EX. NEAR EAST)                   28
OCEANIA                                21
NEAR EAST                              16
EASTERN EUROPE                         12
C.W. OF IND. STATES                    12
NORTHERN AFRICA                         6
NORTHERN AMERICA                        5
BALTICS                                 3
Name: region, dtype: int64

In [106]:
#g = df.groupby("region")
#g.median().T

region,ASIA (EX. NEAR EAST),BALTICS,C.W. OF IND. STATES,EASTERN EUROPE,LATIN AMER. & CARIB,NEAR EAST,NORTHERN AFRICA,NORTHERN AMERICA,OCEANIA,SUB-SAHARAN AFRICA,WESTERN EUROPE
population,26336500.0,2274735.0,7641217.0,6412408.0,1065842.0,3488139.5,21552550.0,65773.0,114689.0,8090068.0,4921096.0
area,208920.0,64589.0,203050.0,67704.0,22966.0,51825.0,724000.0,2166086.0,811.0,245857.0,42310.0
popdensity,192.0,35.2,56.1,102.75,91.4,87.5,38.0,29.0,60.1,39.6,167.15
coast,1.235,0.82,0.0,0.065,3.37,1.145,0.325,2.04,47.08,0.13,2.0
netmigr,0.0,-2.23,-2.085,0.085,-1.22,0.555,-0.39,2.49,0.0,0.0,2.365
infmortality,30.775,7.87,32.425,9.33,18.05,19.06,31.0,7.54,12.62,76.83,4.705
gdp,3450.0,11400.0,3450.0,9100.0,6300.0,9250.0,6000.0,29800.0,5000.0,1300.0,27200.0
literacy,90.6,99.8,99.05,98.6,94.05,83.0,70.0,97.5,95.0,62.95,99.0
phones,61.5,321.4,155.35,296.05,222.85,211.0,123.6,683.2,118.6,9.7,564.5
arable,13.595,29.67,11.135,31.755,7.6,5.305,3.045,13.04,5.71,7.58,16.91


In [113]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
population,227.0,28740284.37,117891326.54,7026.0,437624.0,4786994.0,17497772.5,1313973713.0
area,227.0,598226.96,1790282.24,2.0,4647.5,86600.0,441811.0,17075200.0
popdensity,227.0,379.05,1660.19,0.0,29.15,78.8,190.15,16271.5
coast,227.0,21.17,72.29,0.0,0.1,0.73,10.34,870.66
netmigr,224.0,0.04,4.89,-20.99,-0.93,0.0,1.0,23.06
infmortality,224.0,35.51,35.39,2.29,8.15,21.0,55.7,191.19
gdp,226.0,9689.82,10049.14,500.0,1900.0,5550.0,15700.0,55100.0
literacy,209.0,82.84,19.72,17.6,70.6,92.5,98.0,100.0
phones,223.0,236.06,227.99,0.2,37.8,176.2,389.65,1035.6
arable,225.0,13.8,13.04,0.0,3.22,10.42,20.0,62.11


In [79]:
# Missing values
df.isnull().sum()

country          0
region           0
population       0
area             0
popdensity       0
coast            0
netmigr          3
infmortality     3
gdp              1
literacy        18
phones           4
arable           2
crops            2
other            2
climate         22
birthrate        3
deathrate        4
agriculture     15
industry        16
service         15
dtype: int64

In [83]:
# What are the columns with missing values?
df.columns[np.sum(df.isnull()) != 0]

Index(['netmigr', 'infmortality', 'gdp', 'literacy', 'phones', 'arable',
       'crops', 'other', 'climate', 'birthrate', 'deathrate', 'agriculture',
       'industry', 'service'],
      dtype='object')

### Notes

1. The dataset has 20 variables and 227 observations (one per each country).
2. The dataset column names were standardized and their meaning described in the dataset dictionary.
3. There were no duplicated observations in the dataset.
3. The dataset column types were corrected. The quantitative values used colons insted of periods. This was corrected.
4. The dataset columns with missing values are

## Data Modeling??

## Evaluate the Result

## Deploy