# Analyzing CIA Factbook data using SQL

The <a href='https://www.cia.gov/the-world-factbook/'>CIA World factbook</a> contains information about countries in the world. The information includes, but is not limited to, brief background history, geography, government, and population. This project is to analyze data in this database using SQL. The data dictionary is provided in the table below.

| Field name | Description |
|:-----------|:------------|
| id | Identification number of each country |
| code | Country code |
| name | Country name |
| area | Area of the country in square kilometre |
| area_land | Land area of the country in square kilometre |
| area_water | Inland water area of the country in square kilometre |
| population | Population of the country (unit: person)|
| population_growth | Population's growth rate of the country in percentage |
| birth_rate | Birth rate of the country per year per 1,000 people |
| death_rate | Death rate of the country per year per 1,000 people |
| migration_rate | Net migration rate of the country per year per 1,000 people/n(Positive means people entering the country more than people leaving the country.) |

## Connecting the database

In [1]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db

In [2]:
%%sql

/* Overview of tables in our factbook database. */
    
SELECT *
  FROM sqlite_master
 WHERE type='table';

 * sqlite:///factbook.db
Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


In [3]:
%%sql

/* Overview of data in facts table. */

SELECT *
  FROM facts
 LIMIT 5;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


## Data overview

Prior to the analysis, we will check if there are any outliers and determine whether those outliers should be included in our analysis.

In [4]:
%%sql

SELECT MIN(area) AS min_area,
       MAX(area) AS max_area,
       MIN(area_land) AS min_area_land,
       MAX(area_land) AS max_area_land,
       MIN(population) AS min_pop,
       MAX(population) AS max_pop, 
       MIN(population_growth) AS min_pop_growth, 
       MAX(population_growth) AS max_pop_growth
  FROM facts;

 * sqlite:///factbook.db
Done.


min_area,max_area,min_area_land,max_area_land,min_pop,max_pop,min_pop_growth,max_pop_growth
0,17098242,0,16377742,0,7256490011,0.0,4.02


From the statistics above, there are 3 points that seem unrealistic. 

1. there are one (or more countries) with a total area (or land area) of zero,
2. there are one (or more countries) with a population of zero, and
3. there are one (or more countries) with a population of more than 7 billion people.

These seem odd as a country should have an area and some populations, whereas a population of 7 billion people is too high. We'll dig down further to identify the countries and check whether these data are valid.

In [5]:
%%sql

/* Select countries where (land) area is zero or null. */  

SELECT *
  FROM facts
 WHERE area == 0
    OR area IS NULL
    OR area_land == 0
    OR area_land IS NULL;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
35,cd,Chad,,1259200.0,24800.0,11631456.0,1.89,36.6,14.28,3.45
58,et,Ethiopia,1104300.0,,104300.0,99465819.0,2.89,37.27,8.19,0.22
128,ng,Niger,,1266700.0,300.0,18045729.0,3.25,45.45,12.42,0.56
162,od,South Sudan,644329.0,,,12042910.0,4.02,36.91,8.18,11.47
165,su,Sudan,1861484.0,,,36108853.0,1.72,29.19,7.66,4.29
190,vt,Holy See (Vatican City),0.0,0.0,0.0,842.0,0.0,,,
197,ee,European Union,4324782.0,,,513949445.0,0.25,10.2,10.2,2.5
210,fs,French Southern and Antarctic Lands,,,,,,,,
212,tb,Saint Barthelemy,,,,7237.0,,,,
225,ax,Akrotiri,123.0,,,15700.0,,,,


Per table above, we could group the missing data into 2 groups: missing only total area, and missing either or all of the area-related fields.

| Issue | Countries (example) | Solution |
|-------|-----------|----------|
| Missing only total area | Chad, Niger, and Antartica | To calculate total area based on provided land and water area data |
| Missing total area and/or land area and/or water area | Ethiopia, South Sudan, Sudan, Akrotiri, and Dhekelia | To check the data in CIA factbook website and update in the table. |


According to the CIA factbook website, *Saint Barthelemy* has an area of 25 square kilometres. *Ethiopia* has a land area of 1,096,570 square kilometres. *Sudan*'s land area and water area data are also provided on the website. These information will be updated. However, there is no breakdown area data for South Sudan, Akrotiri, and Dhekelia. We will leave them as is for now.

In regards to Vatican City, it has an area of 0.44 square kilometre. The figure is rounded down to a whole number, hence 0 square kilometre.

![saint_barthelemy_area](pics/saint_barthelemy_area.png)
![ethiopia_area](pics/ethiopia_area.png)
![sudan_area](pics/sudan_area.png)
![vatican_area](pics/vatican_area.png)



In [6]:
%%sql

/* Update null value to zero for calculation purpose.*/

UPDATE facts
   SET area = 0
 WHERE area IS NULL;

UPDATE facts
   SET area_land = 0
 WHERE area_land IS NULL;

UPDATE facts
   SET area_water = 0
 WHERE area_water IS NULL;

SELECT *
  FROM facts
 WHERE area == 0
    OR area_land == 0;

 * sqlite:///factbook.db
Done.
15 rows affected.
18 rows affected.
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
35,cd,Chad,0,1259200,24800,11631456.0,1.89,36.6,14.28,3.45
58,et,Ethiopia,1104300,0,104300,99465819.0,2.89,37.27,8.19,0.22
128,ng,Niger,0,1266700,300,18045729.0,3.25,45.45,12.42,0.56
162,od,South Sudan,644329,0,0,12042910.0,4.02,36.91,8.18,11.47
165,su,Sudan,1861484,0,0,36108853.0,1.72,29.19,7.66,4.29
190,vt,Holy See (Vatican City),0,0,0,842.0,0.0,,,
197,ee,European Union,4324782,0,0,513949445.0,0.25,10.2,10.2,2.5
210,fs,French Southern and Antarctic Lands,0,0,0,,,,,
212,tb,Saint Barthelemy,0,0,0,7237.0,,,,
225,ax,Akrotiri,123,0,0,15700.0,,,,


In [7]:
%%sql

/* Update correct data for Saint Barthelemy, Ethiopia, and Sudan. */

UPDATE facts
   SET area = 25,
       area_land = 25
 WHERE name == 'Saint Barthelemy';

 UPDATE facts
   SET area_land = 1096570,
       area_water = 7730
 WHERE name == 'Ethiopia';

UPDATE facts
   SET area_land = 1731671,
       area_water = 129813
 WHERE name == 'Sudan';

 SELECT *
   FROM facts
  WHERE name IN ('Saint Barthelemy', 'Ethiopia', 'Sudan');

 * sqlite:///factbook.db
Done.
1 rows affected.
1 rows affected.
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
58,et,Ethiopia,1104300,1096570,7730,99465819,2.89,37.27,8.19,0.22
165,su,Sudan,1861484,1731671,129813,36108853,1.72,29.19,7.66,4.29
212,tb,Saint Barthelemy,25,25,0,7237,,,,


In [8]:
%%sql

/* Update the calculation for Chad, Niger, and Antarctica. */

UPDATE facts
   SET area = area_land + area_water
 WHERE name IN ('Chad', 'Niger', 'Antarctica');

SELECT *
  FROM facts
 WHERE name IN ('Chad', 'Niger', 'Antarctica');

 * sqlite:///factbook.db
Done.
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
35,cd,Chad,1284000,1259200,24800,11631456,1.89,36.6,14.28,3.45
128,ng,Niger,1267000,1266700,300,18045729,3.25,45.45,12.42,0.56
250,ay,Antarctica,280000,280000,0,0,,,,


In [9]:
%%sql

/* Check countries with zero population. */

SELECT *
  FROM facts
 WHERE population == (SELECT MIN(population)
                        FROM facts
                     );

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
250,ay,Antarctica,280000,280000,0,0,,,,


The country with zero population is Antarctica. According to the CIA factbook, there are no indigeneous inhabitants in this country. We will include this row in our further analysis as the data is valid.

![antarctica_pop](pics/antarctica_pop.png)

In [10]:
%%sql

/* Check countries with 7 billion population. */

SELECT *
  FROM facts
 WHERE population == (SELECT MAX(population)
                        FROM facts
                     );

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
261,xx,World,0,0,0,7256490011,1.08,18.6,7.8,


This row should be excluded from our analysis as it contains information of the whole world. 

We also notice from an earlier table that European Union is included in this database. This will be excluded from our analysis as the data would be duplicated with other EU countries already listed in the table.

In [11]:
%%sql

/* Recheck the statistics for area and population. */

SELECT MIN(area) AS min_area,
       MIN(population) AS min_pop,
       MAX(population) AS max_pop, 
       MIN(population_growth) AS min_pop_growth, 
       MAX(population_growth) AS max_pop_growth
  FROM facts
 WHERE name <> 'World'
   AND name <> 'European Union';

 * sqlite:///factbook.db
Done.


min_area,min_pop,max_pop,min_pop_growth,max_pop_growth
0,0,1367485388,0.0,4.02


In [12]:
%%sql

/* Check countries with 1 billion population. */

SELECT *
  FROM facts
 WHERE population == (SELECT MAX(population)
                        FROM facts
                       WHERE name <> 'World'
                         AND name <> 'European Union' 
                     );

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
37,ch,China,9596960,9326410,270550,1367485388,0.45,12.49,7.53,0.44


The country with the highest number of population is China.

## Finding densely-populated countries (1)

In order to find a list of densely-populated countries, we'll find the countries that fit 2 criteria, which are

- above-average number of population, and
- below-average country area.

In [13]:
%%sql

SELECT ROUND(AVG(population), 2) AS avg_pop,
       ROUND(AVG(area), 2) AS avg_area
  FROM facts
 WHERE name <> 'World'
   AND name <> 'European Union';

 * sqlite:///factbook.db
Done.


avg_pop,avg_area
30235554.99,527893.96


In [14]:
%%sql

SELECT *, 
       ROUND(population/area_land, 2) AS pop_density
  FROM facts
 WHERE population > (SELECT AVG(population)
                       FROM facts
                      WHERE name <> 'World'
                        AND name <> 'European Union'
                    )
   AND area < (SELECT AVG(area)
                 FROM facts
                WHERE name <> 'World'
                  AND name <> 'European Union'
              )
 ORDER BY (population/area) DESC;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,pop_density
14,bg,Bangladesh,148460,130170,18290,168957745,1.6,21.14,5.61,0.46,1297.0
91,ks,"Korea, South",99720,96920,2800,49115196,0.14,8.19,6.75,0.0,506.0
138,rp,Philippines,300000,298170,1830,100998376,1.61,24.27,6.11,2.09,338.0
85,ja,Japan,377915,364485,13430,126919659,0.16,7.93,9.51,0.0,348.0
192,vm,Vietnam,331210,310070,21140,94348835,0.97,15.96,5.93,0.3,304.0
185,uk,United Kingdom,243610,241930,1680,64088222,0.54,12.17,9.35,2.54,264.0
65,gm,Germany,357022,348672,8350,80854408,0.17,8.47,11.42,1.24,231.0
124,np,Nepal,147181,143351,3830,31551305,1.79,20.64,6.56,3.86,220.0
83,it,Italy,301340,294140,7200,61855120,0.27,8.74,10.19,4.1,210.0
182,ug,Uganda,241038,197100,43938,37101745,3.24,43.79,10.69,0.74,188.0


Using the criteria mentioned above, the top 3 countries with highest population density are Bangladesh, South Korea, and Philippines.

Next, we will change our method. We'll try calculating population density per kilometre for each country and rank them.

## Finding densely-populated countries (2)

Since it is not common to live on water, here we will divide the population by only land area of the country.

In [15]:
%%sql

SELECT *,
       ROUND(population/area_land, 2) AS pop_density
  FROM facts
 ORDER BY pop_density DESC
 LIMIT 10;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,pop_density
205,mc,Macau,28,28,0,592731,0.8,8.88,4.22,3.37,21168.0
117,mn,Monaco,2,2,0,30535,0.12,6.65,9.24,3.83,15267.0
156,sn,Singapore,697,687,10,5674472,1.89,8.27,3.43,14.05,8259.0
204,hk,Hong Kong,1108,1073,35,7141106,0.38,9.23,7.07,1.68,6655.0
251,gz,Gaza Strip,360,360,0,1869055,2.81,31.11,3.04,0.0,5191.0
233,gi,Gibraltar,6,6,0,29258,0.24,14.08,8.37,3.28,4876.0
13,ba,Bahrain,760,760,0,1346613,2.41,13.66,2.69,13.09,1771.0
108,mv,Maldives,298,298,0,393253,0.08,15.75,3.89,12.68,1319.0
110,mt,Malta,316,316,0,413965,0.31,10.18,9.09,1.98,1310.0
227,bd,Bermuda,54,54,0,70196,0.5,11.33,8.23,1.88,1299.0


This method shows that the top 3 densely-populated countries are Macau, Monaco, and Singapore.

Comparing the 2 tables, we can see that Bangladesh (the highest population density country from the earlier table) is not even in the top 10 densely-populated countries by population-to-area ratio.

## Finding countries with the highest water-to-land ratio

In [16]:
%%sql

SELECT *,
       ROUND(CAST(area_water AS FLOAT)/area_land, 2) AS water_to_land_ratio
  FROM facts
 WHERE name <> 'World'
   AND name <> 'European Union'
 ORDER BY water_to_land_ratio DESC
 LIMIT 10;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,water_to_land_ratio
228,io,British Indian Ocean Territory,54400,60,54340,,,,,,905.67
247,vq,Virgin Islands,1910,346,1564,103574.0,0.59,10.31,8.54,7.67,4.52
246,rq,Puerto Rico,13791,8870,4921,3598357.0,0.6,10.86,8.67,8.15,0.55
12,bf,"Bahamas, The",13880,10010,3870,324597.0,0.85,15.5,7.05,0.0,0.39
71,pu,Guinea-Bissau,36125,28120,8005,1726170.0,1.91,33.38,14.33,0.0,0.28
106,mi,Malawi,118484,94080,24404,17964697.0,3.32,41.56,8.41,0.0,0.26
125,nl,Netherlands,41543,33893,7650,16947904.0,0.41,10.83,8.66,1.95,0.23
182,ug,Uganda,241038,197100,43938,37101745.0,3.24,43.79,10.69,0.74,0.22
56,er,Eritrea,117600,101000,16600,6527689.0,2.25,30.0,7.52,0.0,0.16
99,li,Liberia,111369,96320,15049,4195666.0,2.47,34.41,9.69,0.0,0.16


British Indian Ocean Territory is situated in the Indian Ocean. It comprises of atolls and small islands. The closest neighboring country is Maldives, followed by Seachelles.

![BIOT_map](pics/BIOT_map.png)

From the map above, we notice that Maldives and Seychelles are not included in the top 10 countries with the highest water-to-land ratio despite similar geography to British Indian Ocean Territory.

In [17]:
%%sql

SELECT *,
       ROUND(CAST(area_water AS FLOAT)/area_land, 2) AS water_to_land_ratio
  FROM facts
 WHERE name == 'Maldives'
    OR name == 'Seychelles';

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,water_to_land_ratio
108,mv,Maldives,298,298,0,393253,0.08,15.75,3.89,12.68,0.0
154,se,Seychelles,455,455,0,92430,0.83,14.19,6.89,1.0,0.0


It turns out that CIA factbook doesn't have water area data for these 2 countries.

![maldives](pics/maldives.png)
![seychelles](pics/seychelles.png)

## Finding countries with the highest number of population

In [18]:
%%sql

SELECT *,
       ROUND((CAST(population AS FLOAT) / (SELECT SUM(population)
                                             FROM facts
                                            WHERE name <> 'World'
                                              AND name <> 'European Union'
                                          )*100), 2) AS 'world_pop_percent'
  FROM facts
 WHERE name <> 'World'
   AND name <> 'European Union'
 ORDER BY population DESC
 LIMIT 10;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,world_pop_percent
37,ch,China,9596960,9326410,270550,1367485388,0.45,12.49,7.53,0.44,18.84
77,in,India,3287263,2973193,314070,1251695584,1.22,19.55,7.32,0.04,17.25
186,us,United States,9826675,9161966,664709,321368864,0.78,12.49,8.15,3.86,4.43
78,id,Indonesia,1904569,1811569,93000,255993674,0.92,16.72,6.37,1.16,3.53
24,br,Brazil,8515770,8358140,157630,204259812,0.77,14.46,6.58,0.14,2.81
132,pk,Pakistan,796095,770875,25220,199085847,1.46,22.58,6.49,1.54,2.74
129,ni,Nigeria,923768,910768,13000,181562056,2.45,37.64,12.9,0.22,2.5
14,bg,Bangladesh,148460,130170,18290,168957745,1.6,21.14,5.61,0.46,2.33
143,rs,Russia,17098242,16377742,720500,142423773,0.04,11.6,13.69,1.69,1.96
85,ja,Japan,377915,364485,13430,126919659,0.16,7.93,9.51,0.0,1.75


Most of the world's population are in China and India. There are more than 2.5 billion people, or almost 40% of the world's population, living in these 2 countries.

## Conclusion

- Top 3 densely-populated countries are Macau, Monaco, and Singapore.
- The country with the highest water-to-land ratio is the Britisn Indian Ocean Territory.
- Top 3 countries with the highest number of population are China, India, and United States.

However, the data for some countries are not complete. For example, there are no water area data available for Maldives and Seychelles. This affects the result of the analysis.