# Guided Project - Analyzing CIA Factbook Data Using SQL
## <span style='background :yellow' > By MANUEL LA CHICA MALDONADO </span>

In this project, we'll work with data from the [CIA World Factbook](https://www.cia.gov/library/publications/the-world-factbook/), a compendium of statistics about all of the countries on Earth. If you want to work on this project locally, you can download the SQLite factbook.db database [here](https://dsserver-prod-resources-1.s3.amazonaws.com/257/factbook.db)

We'll use SQL in Jupyter Notebook to explore and analyze data from this database.


## Data dictionary

* **code** - Code for every country name.
* **name** - The name of the country.
* **area** - The total land and sea area of the country.
* **population** - The country's population.
* **population_growth** - The country's population growth as a percentage.
* **birth_rate** - The country's birth rate, or the number of births a year per 1,000 people.
* **death_rate** - The country's death rate, or the number of death a year per 1,000 people.
* **area** - The country's total area (both land and water).
* **area_land** - The country's land area in square kilometers.
* **area_water** - The country's waterarea in square kilometers.
* **migration_rate** - The country's migration rate

## Introduction

We'll use the following code to connect our Jupyter Notebook to our database file:

In [1]:
%%capture
%load_ext sql
%sql sqlite:///ProjectsDatasets/factbook.db
# Establishing a connection to the database file that is in our folder ProjectsDatasets

Writing this query to return information on the tables in the database.

**NOTES: To run SQL queries in Jupyter, WE MUST ADD %%sql on its own line to the start of our query.**
**Helps Jupyter recognize the cell as an SQL query**

In [2]:
%%sql
SELECT
 *
FROM 
 sqlite_master
WHERE 
 type='table'
    

 * sqlite:///ProjectsDatasets/factbook.db
Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


We see that this database contains two tables: *sqlite_sequence* and *facts*. For our project, we are only interested in the second table that contains facts about countries. Let's have look at this table.

## Overview of the data

Let's see how many rows we have in our dataset

In [16]:
%%sql
SELECT
 COUNT(*)
FROM
 facts

 * sqlite:///ProjectsDatasets/factbook.db
Done.


COUNT(*)
261


We'll begin by getting a sense of what the data looks like. Let's explore *facts* table. 

In [3]:
%%sql
SELECT
 *
FROM
 facts
LIMIT
 5

 * sqlite:///ProjectsDatasets/factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


Here you can find again the descriptions for some of the columns:

* **name** - The name of the country.
* **area** - The country's total area (both land and water).
* **area_land** - The country's land area in square kilometers.
* **area_water** - The country's waterarea in square kilometers.
* **population** - The country's population.
* **population_growth** - The country's population growth as a percentage.
* **birth_rate** - The country's birth rate, or the number of births a year per 1,000 people.
* **death_rate** - The country's death rate, or the number of death a year per 1,000 people.

Let's start by calculating some summary statistics and look for any outlier countries.

## Summary Statistics

Let's calculate some summary statics such as:

* Minimum population.
* Maximum population.
* Minimum population growth.
* Maximum population growth.

In [4]:
%%sql
SELECT
 MIN(population) AS "MIN population",
 MAX(population) AS "MAX population",
 MIN(population_growth) AS "MIN population_growth",
 MAX(population_growth) AS "MAX population_growth"
FROM
 facts


 * sqlite:///ProjectsDatasets/factbook.db
Done.


MIN population,MAX population,MIN population_growth,MAX population_growth
0,7256490011,0.0,4.02


A few things stick out from the summary statistics in the last screen:

* There's a country with a population of 0

* There's a country with a population of 7256490011 (or more than 7.2 billion people)

Let's use subqueries to zoom in on just these countries (outliers) without using the specific values.

## Exploring Outliers

In [5]:
%%sql
SELECT
 *
FROM
 facts
WHERE
 population = (SELECT MIN(population) FROM facts)

 * sqlite:///ProjectsDatasets/factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
250,ay,Antarctica,,280000,,0,,,,


It seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook [page for Antartica](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html)

In [6]:
%%sql
SELECT
 *
FROM
 facts
WHERE
 population = (SELECT MAX(population) FROM facts)

 * sqlite:///ProjectsDatasets/factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
261,xx,World,,,,7256490011,1.08,18.6,7.8,


We also see that the table contains a row for the whole world, which explains the maximum population of over 7.2 billion we found earlier.



**Now that we know this, we should recalculate the summary statistics we calculated earlier, while excluding the row for the whole world and Antarctica.**


## Summary Statistics Revisited

In [7]:
%%sql
SELECT
 MIN(population) AS "MIN population",
 MAX(population) AS "MAX population",
 MIN(population_growth) AS "MIN population_growth",
 MAX(population_growth) AS "MAX population_growth"
FROM
 facts
WHERE
 name != 'World' and name != 'Antarctica'

 * sqlite:///ProjectsDatasets/factbook.db
Done.


MIN population,MAX population,MIN population_growth,MAX population_growth
48,1367485388,0.0,4.02


**There's a country whose population closes in on 1.4 billion! That is probably China!
There's another country with a population of 48! Is that one The Vatican?**  

In [8]:
%%sql
SELECT
 * 
FROM
 facts
WHERE
 name LIKE '%vatican%' or name LIKE '%china%'

 * sqlite:///ProjectsDatasets/factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
37,ch,China,9596960,9326410,270550,1367485388,0.45,12.49,7.53,0.44
190,vt,Holy See (Vatican City),0,0,0,842,0.0,,,


**The Vatican is not the country with 48 people. Let's find out which it is!**

In [9]:
%%sql
SELECT
 *
FROM
 facts
WHERE 
 population = 48

 * sqlite:///ProjectsDatasets/factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
238,pc,Pitcairn Islands,47,47,0,48,0.0,,,


## Exploring Average Population and Area

 Let's look at the average values for these two columns: **Population and Area**
 
 We should take care of discarding the row for the whole planet.

In [10]:
%%sql
SELECT
 ROUND(AVG(population), 2) AS "Average population",
 ROUND(AVG(area), 2) AS "Average area"
FROM
 facts
WHERE
 name != 'World'

 * sqlite:///ProjectsDatasets/factbook.db
Done.


Average population,Average area
32242666.57,555093.55


We see that the average population is around 32 million and the average area is 555 thousand square kilometers.

## Finding Densely Populated Countries 


Let's explore density. Density depends on the population and the country's area. We'll build on the query we wrote for the previous screen to find countries that are densely populated. We'll identify countries that have:

* Above average values for population.

* Below average values for area.

In [13]:
%%sql
SELECT
 name,
 area,
 population,
 ROUND(CAST(population AS Float) / area, 2) AS population_density
FROM
 facts
WHERE
 population > (SELECT AVG(population) FROM facts WHERE name != 'World') 
 AND area < (SELECT AVG(area) FROM facts WHERE name != 'World')
ORDER BY
 population_density DESC

 * sqlite:///ProjectsDatasets/factbook.db
Done.


name,area,population,population_density
Bangladesh,148460,168957745,1138.07
"Korea, South",99720,49115196,492.53
Philippines,300000,100998376,336.66
Japan,377915,126919659,335.84
Vietnam,331210,94348835,284.86
United Kingdom,243610,64088222,263.08
Germany,357022,80854408,226.47
Italy,301340,61855120,205.27
Uganda,241038,37101745,153.92
Thailand,513120,67976405,132.48


Some of these countries are generally known to be densely populated, so we have confidence in our results!

## Countries with the most/ least

In this section we would be answering questions like:

* What country has the most people.
* What country has the least people.
* Top 5 countries with the highest population growth.
* Top 5 countries with the highest death/ birth rates.
* Top 5 countries with the highest birth/ death rate.
* The countries with more water than land(highest ratio of water to land).

In [18]:
%%sql
SELECT
 name,
 MAX(population) population
FROM
 facts
WHERE
 name != 'World'

 * sqlite:///ProjectsDatasets/factbook.db
Done.


name,population
China,1367485388


**China is the country with more people**

In [22]:
%%sql
SELECT
 *
FROM
 facts
WHERE
 population = (SELECT MIN(population) FROM Facts WHERE name != 'Antarctica')

 * sqlite:///ProjectsDatasets/factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
238,pc,Pitcairn Islands,47,47,0,48,0.0,,,


With the exclusion of Antarctica, **the Pitcairn Islands has the least number of people with just 48.**

In [25]:
%%sql
SELECT
 name,
 population_growth
FROM
 facts
ORDER BY
 2 DESC
LIMIT
 5

 * sqlite:///ProjectsDatasets/factbook.db
Done.


name,population_growth
South Sudan,4.02
Malawi,3.32
Burundi,3.28
Niger,3.25
Uganda,3.24


**These are the top 5 countries with the highest population growth rate.**

In [28]:
%%sql 
SELECT
 name,
 death_rate,
 birth_rate,
 ROUND(death_rate / birth_rate, 2) ratio 
FROM
 facts
ORDER BY 
 ratio DESC
LIMIT
 5
 

 * sqlite:///ProjectsDatasets/factbook.db
Done.


name,death_rate,birth_rate,ratio
Bulgaria,14.44,8.92,1.62
Serbia,13.66,9.08,1.5
Latvia,14.31,10.0,1.43
Lithuania,14.27,10.1,1.41
Hungary,12.73,9.16,1.39


**These are the top 5 countries with the highest death / birth ratio. For each birth, 1.62 persons die in Bulgaria.**

In [31]:
%%sql 
SELECT
 name,
 birth_rate,
 death_rate,
 ROUND(birth_rate / death_rate, 2) ratio 
FROM
 facts
ORDER BY 
 ratio DESC
LIMIT
 5

 * sqlite:///ProjectsDatasets/factbook.db
Done.


name,birth_rate,death_rate,ratio
Gaza Strip,31.11,3.04,10.23
Kuwait,19.91,2.18,9.13
Iraq,31.45,3.77,8.34
United Arab Emirates,15.43,1.97,7.83
Oman,24.44,3.36,7.27


**These are the top 5 countries with the highest birth / death ratio. For each death, 10.23 people born in Gaza Strip.**

In [32]:
%%sql
SELECT
 *
FROM
 facts
WHERE
 area_water > area_land

 * sqlite:///ProjectsDatasets/factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
228,io,British Indian Ocean Territory,54400,60,54340,,,,,
247,vq,Virgin Islands,1910,346,1564,103574.0,0.59,10.31,8.54,7.67


**These are the two countries with more water than land**