# Countries of the World

By Krzysztof Satola from: [github.com/ksatola](https://github.com/ksatola).

Based on CRISP-DM (Cross Industry Process for Data Mining).

## Business Understanding

### Dataset Dictionary

- **country** - country name
- **region** - region name
- **population** - number of people within country
- **area** - area in sq. mi.
- **popdensity** - population density per sq. mi.
- **coast** - coastline (coast/area ratio)
- **netmigr** - net migration. The net migration rate is the difference between the number of immigrants (people coming into an area) and the number of emigrants (people leaving an area) throughout the year. When the number of immigrants is larger than the number of emigrants, a positive net migration rate occurs. A positive net migration rates indicates that there are more people entering than leaving an area. When more emigrate from a country, the result is a negative net migration rate, meaning that more people are leaving than entering the area. When there is an equal number of immigrants and emigrants, the net migration rate is balanced ([source](https://en.wikipedia.org/wiki/Net_migration_rate)).
- **infmortality** - infant mortality (per 1000 births)
- **gdp** - gross domestic product (GDP) in $ per capita. The value of all final goods and services produced within a nation in a given year (2013), converted at market exchange rates to current U.S. dollars, divided by the average population for the same year ([source](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita)).
- **literacy** - literacy level in %
- **phones** - phones per 1000
- **arable** - percent of arable areas
- **crops** - percent of cropland used to grow food
- **other** - other (%) ~????????????????~
- **climate** - climate type
- **birthrate** - the birth rate (technically, births/population rate), the total number of live births per 1,000 in a population in 2013 ([source](https://en.wikipedia.org/wiki/Birth_rate)).
- **deathrate** - number of deaths in units of deaths per 1,000 individuals ([source](https://en.wikipedia.org/wiki/Mortality_rate)).
- **agriculture** - percentage of GDP sector composition ratio for agriculture economy sector ([source](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_sector_composition)). Agriculture % + Industry % + Service = 100% of GDP 
- **industry** - percentage of GDP sector composition ratio for industry economy sector
- **service** - percentage of GDP sector composition ratio for service economy sector

### Objectives

In this project, I explore [Countries of the Worlds Kaggle dataset](https://www.kaggle.com/fernandol/countries-of-the-world) to answer the following questions:

1. How different regions compared to each other in terms of area, population, population density
2. 
3. What are the most significant predictors for country's GDP ($ per capita)?

Next, I write a post to ??? about my findings.

## Data Understanding

In [25]:
# Import required libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Random state
rstate = 123

In [11]:
%load_ext version_information

In [14]:
# Document versions of used libraries
%version_information numpy, pandas, matplotlib, seaborn

Software,Version
Python,3.7.1 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython,7.2.0
OS,Darwin 18.2.0 x86_64 i386 64bit
numpy,1.16.2
pandas,0.23.4
matplotlib,3.0.2
seaborn,0.9.0
Tue Mar 19 22:36:10 2019 CET,Tue Mar 19 22:36:10 2019 CET


In [19]:
# Load data from a CSV file
df = pd.read_csv('./data/countries of the world.csv')

In [20]:
# Initial look into the dataset
df.head().T

Unnamed: 0,0,1,2,3,4
Country,Afghanistan,Albania,Algeria,American Samoa,Andorra
Region,ASIA (EX. NEAR EAST),EASTERN EUROPE,NORTHERN AFRICA,OCEANIA,WESTERN EUROPE
Population,31056997,3581655,32930091,57794,71201
Area (sq. mi.),647500,28748,2381740,199,468
Pop. Density (per sq. mi.),480,1246,138,2904,1521
Coastline (coast/area ratio),000,126,004,5829,000
Net migration,2306,-493,-039,-2071,66
Infant mortality (per 1000 births),16307,2152,31,927,405
GDP ($ per capita),700,4500,6000,8000,19000
Literacy (%),360,865,700,970,1000


In [22]:
df.tail().T

Unnamed: 0,222,223,224,225,226
Country,West Bank,Western Sahara,Yemen,Zambia,Zimbabwe
Region,NEAR EAST,NORTHERN AFRICA,NEAR EAST,SUB-SAHARAN AFRICA,SUB-SAHARAN AFRICA
Population,2460492,273008,21456188,11502010,12236805
Area (sq. mi.),5860,266000,527970,752614,390580
Pop. Density (per sq. mi.),4199,10,406,153,313
Coastline (coast/area ratio),000,042,036,000,000
Net migration,298,,0,0,0
Infant mortality (per 1000 births),1962,,615,8829,6769
GDP ($ per capita),800,,800,800,1900
Literacy (%),,,502,806,907


In [26]:
df.sample(5, random_state=rstate).T

Unnamed: 0,125,122,156,150,79
Country,Malawi,Macau,Pakistan,Nicaragua,Greece
Region,SUB-SAHARAN AFRICA,ASIA (EX. NEAR EAST),ASIA (EX. NEAR EAST),LATIN AMER. & CARIB,WESTERN EUROPE
Population,13013926,453125,165803560,5570129,10688058
Area (sq. mi.),118480,28,803940,129494,131940
Pop. Density (per sq. mi.),1098,161830,2062,430,810
Coastline (coast/area ratio),000,14643,013,070,1037
Net migration,0,486,-277,-122,235
Infant mortality (per 1000 births),10332,439,7244,2911,553
GDP ($ per capita),600,19400,2100,2300,20000
Literacy (%),627,945,457,675,975


In [4]:
# Dataset size
df.shape

(227, 20)

In [8]:
# Variables
df.columns

Index(['Country', 'Region', 'Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)',
       'Crops (%)', 'Other (%)', 'Climate', 'Birthrate', 'Deathrate',
       'Agriculture', 'Industry', 'Service'],
      dtype='object')

In [29]:
# Naming convention, simplify column names and build a dataset dictionary (see above)
df.rename(columns={"Country":"country", 
                  "Region":"region", 
                  "Population":"population", 
                  "Area (sq. mi.)":"area", 
                  "Pop. Density (per sq. mi.)":"popdensity", 
                  "Coastline (coast/area ratio)":"coast", 
                  "Net migration":"netmigr", 
                  "Infant mortality (per 1000 births)":"infmortality", 
                  "GDP ($ per capita)":"gdp", 
                  "Literacy (%)":"literacy", 
                  "Phones (per 1000)":"phones", 
                  "Arable (%)":"arable", 
                  "Crops (%)":"crops", 
                  "Other (%)":"other", 
                  "Climate":"climate", 
                  "Birthrate":"birthrate", 
                  "Deathrate":"deathrate", 
                  "Agriculture":"agriculture", 
                  "Industry":"industry", 
                  "Service":"service"}, inplace=True)

In [30]:
df.columns

Index(['country', 'region', 'population', 'area', 'popdensity', 'coast',
       'netmigr', 'infmortality', 'gdp', 'literacy', 'phones', 'arable',
       'crops', 'other', 'climate', 'birthrate', 'deathrate', 'agriculture',
       'industry', 'service'],
      dtype='object')

In [39]:
# Examplary country data
df.iloc[163]

country                                     Poland 
region          EASTERN EUROPE                     
population                                 38536869
area                                         312685
popdensity                                    123,3
coast                                          0,16
netmigr                                       -0,49
infmortality                                   8,51
gdp                                           11100
literacy                                       99,8
phones                                        306,3
arable                                        45,91
crops                                          1,12
other                                         52,97
climate                                           3
birthrate                                      9,85
deathrate                                      9,89
agriculture                                    0,05
industry                                      0,311
service     

In [44]:
# What are the dataset column data types?
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
country         227 non-null object
region          227 non-null object
population      227 non-null int64
area            227 non-null int64
popdensity      227 non-null object
coast           227 non-null object
netmigr         224 non-null object
infmortality    224 non-null object
gdp             226 non-null float64
literacy        209 non-null object
phones          223 non-null object
arable          225 non-null object
crops           225 non-null object
other           225 non-null object
climate         205 non-null object
birthrate       224 non-null object
deathrate       223 non-null object
agriculture     212 non-null object
industry        211 non-null object
service         212 non-null object
dtypes: float64(1), int64(2), object(17)
memory usage: 35.5+ KB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
country         227 non-null object
region          227 non-null object
population      227 non-null int64
area            227 non-null int64
popdensity      227 non-null object
coast           227 non-null object
netmigr         224 non-null object
infmortality    224 non-null object
gdp             226 non-null float64
literacy        209 non-null object
phones          223 non-null object
arable          225 non-null object
crops           225 non-null object
other           225 non-null object
climate         205 non-null object
birthrate       224 non-null object
deathrate       223 non-null object
agriculture     212 non-null object
industry        211 non-null object
service         212 non-null object
dtypes: float64(1), int64(2), object(17)
memory usage: 35.5+ KB


### Notes

1. The dataset has 20 variables and 227 observations (one per each country).
2. The dataset column names were standardized and their meaning described in the dataset dictionary.
3. The dataset column types were corrected.
4. The dataset columns with missing values are

## Prepare Data

## Data Modeling??

## Evaluate the Result

## Deploy