# CIA Factbook

The purpose of this project was to explore the CIA Factbook, a database of information about all the countries in the world, using SQL queries in order to find out interesting information.

The project provided the opportunity to develop my knowledge of SQL and learn Jupyter Notebook. The version of SQL used is sqlite. It makes use of the following types of query:

* Summary statistics, such as MIN, MAX, COUNT and AVG
* Identifying distinct or repeated records
* Sub-queries
* Floating point division and Casting
* Ordering, Limiting, Multiple conditions

The original Jupyter notebook can be downloaded from [here]http://pravjey.github.io/CIAFactbook/factbook.ipynb] 


In [2]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db

In [3]:
%sql SELECT * FROM sqlite_master WHERE type='table';

 * sqlite:///factbook.db
Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


## First five rows of CIA Factbook

In [6]:
%sql SELECT * FROM facts LIMIT 5;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


### The number of records

In [5]:
%%sql 
SELECT COUNT(*) AS "Number of records",
COUNT(DISTINCT(name)) AS "Number of distinct countries"
FROM facts;

Done.


Number of records,Number of distinct countries
261,261


Each country has its own, unique record

### Summary statistics

The first task was to find the range for the population and population growth

In [6]:
%%sql
SELECT MIN(population) AS "Smallest population",
        MAX(population) AS "Largest population",
        MIN(population_growth) AS "Smallest population growth",
        MAX(population_growth) AS "Largest population growth"
        FROM facts;

Done.


Smallest population,Largest population,Smallest population growth,Largest population growth
0,7256490011,0.0,4.02


Out of the whole database, the lowest population of a country was zero and the highest was over 7 billion. This gave rise to two questions:

* How could a country have a population of zero?
* How could a country have a population that was equivalent to the current population of the whole world?

Upon further examination, it was found that the country with a population of zero is Antarctica. As Antarctica was a country where no-one lived, other than penguins, and the only human inhabitants were temporarily-based research scientists, it made sense to exclude Antartica from further analysis.

So the country with the lowest population is the Pitcairn Islands.

In [7]:
%%sql
SELECT name from facts 
WHERE population = (SELECT MIN(population) FROM facts);

Done.


name
Antarctica


In [8]:
%%sql
SELECT name from facts 
WHERE population = (SELECT MIN(population) FROM facts
                    WHERE name <> "Antarctica");

Done.


name
Pitcairn Islands


Similarly, the "country" with the largest population in the CIA Factbook is "the World". Again, it did not make sense to include this record in the analysis of countries. So the country with the largest population is China. 

In [9]:
%%sql
SELECT name from facts 
WHERE population = (SELECT MAX(population) FROM facts);

Done.


name
World


In [10]:
%%sql
SELECT name from facts 
WHERE population = (SELECT MAX(population) FROM facts
                    WHERE name <> "World");

Done.


name
China


The countries with the lowest growth in population, i.e. zero growth, were Vatican City State, Cocos Islands, Greenland and the Pitcairn Islands.

The country with the highest growth in population was South Sudan. 

In [11]:
%%sql
SELECT name from facts 
WHERE population_growth = (SELECT MIN(population_growth) FROM facts);

Done.


name
Holy See (Vatican City)
Cocos (Keeling) Islands
Greenland
Pitcairn Islands


In [12]:
%%sql
SELECT name from facts 
WHERE population_growth = (SELECT MAX(population_growth) FROM facts);

Done.


name
South Sudan


So, excluding the records for Antartica and the World, the range for population and population growth are as follows:

In [14]:
%%sql
SELECT MIN(population) AS "Smallest population",
        MAX(population) AS "Largest population",
        MIN(population_growth) AS "Smallest population growth",
        MAX(population_growth) AS "Largest population growth"
        FROM facts
        WHERE name <> "World" AND name <> "Antarctica";

Done.


Smallest population,Largest population,Smallest population growth,Largest population growth
48,1367485388,0.0,4.02


 ### Average population and area

In [15]:
%sql SELECT AVG(population) FROM facts

Done.


AVG(population)
62094928.32231405


In [16]:
%sql SELECT AVG(area) FROM facts

Done.


AVG(area)
555093.546184739


#### Top ten countries with an above-average area and population, in descending order

In [14]:
%%sql
SELECT name, population, area
FROM facts
WHERE population > (SELECT AVG(population) FROM facts) AND
      area > (SELECT AVG(area) FROM facts)
ORDER BY population DESC;

 * sqlite:///factbook.db
Done.


name,population,area
China,1367485388,9596960
India,1251695584,3287263
European Union,513949445,4324782
United States,321368864,9826675
Indonesia,255993674,1904569
Brazil,204259812,8515770
Pakistan,199085847,796095
Nigeria,181562056,923768
Russia,142423773,17098242
Mexico,121736809,1964375


It is worth pointing out that the the European Union is included in the list of countries with an above-average area and population. Since the E.U. is not technically a country, but a group of countries, this could also be excluded from further analysis. So the top-ten countries with above-average population and area are:

In [12]:
%%sql
SELECT name, population, area
FROM facts
WHERE population > (SELECT AVG(population) FROM facts) AND
      area > (SELECT AVG(area) FROM facts) AND
      name <> "European Union"
ORDER BY population DESC;

 * sqlite:///factbook.db
Done.


name,population,area
China,1367485388,9596960
India,1251695584,3287263
United States,321368864,9826675
Indonesia,255993674,1904569
Brazil,204259812,8515770
Pakistan,199085847,796095
Nigeria,181562056,923768
Russia,142423773,17098242
Mexico,121736809,1964375
Ethiopia,99465819,1104300


### Water to land ratio

#### Top ten countries by water to land ratio

In [23]:
%%sql
SELECT name, (CAST(area_water AS FLOAT) / CAST(area_land AS FLOAT)) 
        AS "Water to land ratio"
FROM facts
ORDER BY (CAST(area_water AS FLOAT) / CAST(area_land AS FLOAT)) DESC
LIMIT 10;

 * sqlite:///factbook.db
Done.


name,Water to land ratio
British Indian Ocean Territory,905.6666666666666
Virgin Islands,4.520231213872832
Puerto Rico,0.5547914317925592
"Bahamas, The",0.3866133866133866
Guinea-Bissau,0.2846728307254623
Malawi,0.2593962585034013
Netherlands,0.2257103236656536
Uganda,0.2229223744292237
Eritrea,0.1643564356435643
Liberia,0.1562396179401993


It can be seen that only two countries, British Indian Ocean Territory (aka Chagos Islands) and the Virgin Islands, comprise more water than land. This can be seen from the query below, comparing the actual water area to the actual land area.

All the other countries in the Factbook consist of more land than water.

In [28]:
%%sql
SELECT name
FROM facts
WHERE area_water > area_land;

Done.


name
British Indian Ocean Territory
Virgin Islands


In [None]:
Eighty-nine countries have a water to land ratio of zero, i.e. none of the area of the country comprises of water features.

In [29]:
%%sql
SELECT COUNT(name)
FROM facts
WHERE (CAST(area_water AS FLOAT) / CAST(area_land AS FLOAT)) = 0.0;

 * sqlite:///factbook.db
Done.


COUNT(name)
89


### Population change

A country's population increases in accordance with the birth and migration rate and decreases in accordance with the death rate. It is assumed that the figures for birth rate, death rate and migration rate in the database are percentages. So the query to find out the countries expected to see the largest population increase for the coming are is:

In [36]:
%%sql
SELECT name, ((birth_rate - death_rate + migration_rate) / 100) * population
        AS "Population increase"
FROM facts
ORDER BY ((birth_rate - death_rate + migration_rate) / 100) * population DESC
LIMIT 10;

Done.


name,Population increase
India,153583048.1568
China,73844210.952
Nigeria,45317889.1776
Pakistan,35098834.82609999
Indonesia,29464871.8774
Ethiopia,29143484.967000004
Bangladesh,27016343.4255
United States,26352246.848
Philippines,20452171.14
"Congo, Democratic Republic of the",19907284.1088


I was particularly interested to find out which countries are likely to see decreases in population.

In the first instance, these would be countries where the death rate is bigger than the birth rate. There are 24 such countries.

In [45]:
%%sql
SELECT COUNT(name)
FROM facts
WHERE death_rate > birth_rate;

Done.


COUNT(name)
24


In [33]:
%%sql
SELECT name
FROM facts
WHERE death_rate > birth_rate;

 * sqlite:///factbook.db
Done.


name
Austria
Belarus
Bosnia and Herzegovina
Bulgaria
Croatia
Czech Republic
Estonia
Germany
Greece
Hungary


The problem with simply comparing death rate and birth rate alone is that birth rate is not the only factor that contributes to population increase. Migration rate also has a role to play. So I then amended the above query to search for the countries where the death rate was higher than the birth and migration rates combined.

There were 13 such countries.

Apart from Japan, they were all in Central and Eastern Europe.

In [47]:
%%sql
SELECT COUNT(name)
FROM facts
WHERE death_rate > (birth_rate + migration_rate)

Done.


COUNT(name)
13


In [50]:
%%sql
SELECT name
FROM facts
WHERE death_rate > (birth_rate + migration_rate);

Done.


name
Belarus
Bosnia and Herzegovina
Bulgaria
Croatia
Germany
Greece
Hungary
Japan
Romania
Russia


Finally, it would be interesting to see in how many countries migration rate has a bigger impact on population increase than birth rate

In [34]:
%%sql
SELECT name
FROM facts
WHERE birth_rate < migration_rate;

 * sqlite:///factbook.db
Done.


name
Luxembourg
"Micronesia, Federated States of"
Qatar
Singapore
Saint Pierre and Miquelon
British Virgin Islands
Cayman Islands


There are only seven countries in the world where migration rate has a bigger impact on population increase than the birth rate. In all other countries, birth rate has a bigger impact than migration rate.

###  Countries with largest population densities

China, India, Russia and the United States may be some of the countries with the largest populations but they also take up a lot of geographic area. So, the population may be further apart. The interesting question is in which countries are there high populations crammed into small geographic areas.

In [39]:
%%sql
SELECT name, (CAST(population AS FLOAT) / CAST(area AS FLOAT)) AS "Population to Area ratio"
FROM facts
ORDER BY (CAST(population AS FLOAT) / CAST(area AS FLOAT)) DESC
LIMIT 10;

Done.


name,Population to Area ratio
Macau,21168.964285714286
Monaco,15267.5
Singapore,8141.279770444763
Hong Kong,6445.041516245487
Gaza Strip,5191.819444444444
Gibraltar,4876.333333333333
Bahrain,1771.8592105263158
Maldives,1319.6409395973155
Malta,1310.01582278481
Bermuda,1299.925925925926


It is interesting to note that the countries with largest population densities are all tiny in a geographic sense.

The average population to area ration is:

In [7]:
%%sql
SELECT AVG(CAST(population AS FLOAT) / CAST(area AS FLOAT)) AS "Average Population to Area ratio"
FROM facts

 * sqlite:///factbook.db
Done.


Average Population to Area ratio
419.66252469247945


It is interesting that India, despite having one of the largest populations and area in the world, actually has a below-average population to area ratio.  

Most of the countries with below-average population to area ratios appear to be have small countries and islands or island archipelagos.

It is interesting to note that the United Kingdom, and the European Union as a whole - despite grumblings about influx of migrants - actually have below-average population to area ratios.

In [15]:
%%sql
SELECT name, (CAST(population AS FLOAT) / CAST(area AS FLOAT)) AS "Population to Area ratio"
FROM facts
WHERE (CAST(population AS FLOAT) / CAST(area AS FLOAT)) < (SELECT AVG(CAST(population AS FLOAT) / CAST(area AS FLOAT))
                                                            FROM facts)
ORDER BY (CAST(population AS FLOAT) / CAST(area AS FLOAT)) DESC;

 * sqlite:///factbook.db
Done.


name,Population to Area ratio
Tuvalu,418.0384615384616
Netherlands,407.9605228317647
Marshall Islands,398.8453038674033
Israel,387.5452094366875
Burundi,385.99626302551206
India,380.7713541630226
Belgium,370.9372707023061
Haiti,364.325009009009
Comoros,349.42774049217
Philippines,336.6612533333333


### Intriguing information

There is one country in the world where the recorded total area is not the same as the sum of the land area and water area.

In [40]:
%%sql
SELECT name
FROM facts
WHERE area <> (area_land + area_water);

Done.


name
"Saint Helena, Ascension, and Tristan da Cunha"


Names of countries vary in length, from 4 to 45, with an average length of 10.

In [58]:
%%sql
SELECT MAX(LENGTH(name)) AS "Longest name", 
       MIN(LENGTH(name)) AS "Shortest name", 
       ROUND(AVG(LENGTH(name)),0) AS "Average length of name"
FROM facts;

Done.


Longest name,Shortest name,Average length of name
45,4,10.0


The countries with the longest names are:

In [36]:
%%sql
SELECT name
FROM facts
WHERE LENGTH(name) = (SELECT MAX(LENGTH(name))
                      FROM facts);

 * sqlite:///factbook.db
Done.


name
"Saint Helena, Ascension, and Tristan da Cunha"
United States Pacific Island Wildlife Refuges


The countries with the shortest names are:

In [62]:
%%sql
SELECT name
FROM facts
WHERE LENGTH(name) = (SELECT MIN(LENGTH(name))
                      FROM facts);

Done.


name
Chad
Cuba
Fiji
Iran
Iraq
Laos
Mali
Oman
Peru
Togo


There are 84 countries with above-average length of names and 159 countries with below average length of names.

In [64]:
%%sql
SELECT COUNT(name)
FROM facts
WHERE LENGTH(name) > (SELECT ROUND(AVG(LENGTH(name)),0)
                      FROM facts);

Done.


COUNT(name)
84


In [65]:
%%sql
SELECT COUNT(name)
FROM facts
WHERE LENGTH(name) < (SELECT ROUND(AVG(LENGTH(name)),0)
                      FROM facts);

Done.


COUNT(name)
159


[Back to Portfolio Page][http://pravjey.github.io/archive.html] 