# Analyzing CIA Factbook Data Using SQL

## Introduction
In this project, we'll work with data from the [CIA World Factbook](https://www.cia.gov/library/publications/the-world-factbook/), a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like the following:
* `population` — the global population.
* `population_growth` — the annual population growth rate, as a percentage.
* `area` — the total land and water area.

We'll use the following code to connect our Juypter Notebook to our database file:

In [1]:
%%capture
%load_ext sql
%sql sqlite:///dataset/factbook.db

## Overview of the Data

We can query the database to get the name of the table and what the table looks like in the `factbook.db` database.

In [2]:
%%sql
SELECT *
  FROM sqlite_master
 WHERE type = 'table'

 * sqlite:///dataset/factbook.db
Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


In [3]:
%%sql
SELECT *
  FROM facts
 LIMIT 5;

 * sqlite:///dataset/factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


Here are the descriptions for some of the columns:
* name — the name of the country.
* area— the country's total area (both land and water).
* area_land — the country's land area in [square kilometers](https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html).
* area_water — the country's waterarea in square kilometers.
* population — the country's population.
* population_growth— the country's population growth as a percentage.
* birth_rate — the country's birth rate, or the number of births per year per 1,000 people.
* death_rate — the country's death rate, or the number of death per year per 1,000 people.

Let's start by calculating some summary statistics and look for any outlier countries.

## Summary Statistics

In [4]:
%%sql
SELECT MIN(population) AS min_population, 
       MAX(population) AS max_population, 
       MIN(population_growth) AS min_population_growth,
       MAX(population_growth) AS max_population_growth
  FROM facts

 * sqlite:///dataset/factbook.db
Done.


min_population,max_population,min_population_growth,max_population_growth
0,7256490011,0.0,4.02


We see a few interesting things in the summary statistics on the previous screen:
* There's a country with a population of `0`
* There's a country with a population of `7256490011` (or more than 7.2 billion people)

## Exploring Outliers
Let's use subqueries to zoom in on just these countries without using the specific values.

In [5]:
%%sql
SELECT name
  FROM facts
 WHERE population = (SELECT MIN(population)
                       FROM facts);

 * sqlite:///dataset/factbook.db
Done.


name
Antarctica


In [6]:
%%sql
SELECT name
  FROM facts
 WHERE population = (SELECT MAX(population)
                       FROM facts);

 * sqlite:///dataset/factbook.db
Done.


name
World


It seems like the table contains a row for the whole world, which explains the population of over 7.2 billion. It also seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook [page for Antarctica](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html):
![nn](img/fb_antarctica.png)
Now that we know this, we should recalculate the summary statistics we calculated earlier - this time excluding the row for the whole world.

In [7]:
%%sql
SELECT MIN(population) AS min_population, 
       MAX(population) AS max_population, 
       MIN(population_growth) AS min_population_growth,
       MAX(population_growth) AS max_population_growth
  FROM facts
 WHERE name <> 'World';

 * sqlite:///dataset/factbook.db
Done.


min_population,max_population,min_population_growth,max_population_growth
0,1367485388,0.0,4.02


## Exploring Average Population and Area

In [8]:
%%sql
SELECT AVG(population) AS avg_population,
       AVG(area) AS avg_area
  FROM facts
 WHERE name <> 'World';

 * sqlite:///dataset/factbook.db
Done.


avg_population,avg_area
32242666.56846473,555093.546184739


## Finding Densely Populated Countries

To finish, we'll build on the query we wrote for the previous cell to find countries that are densely populated. We'll identify countries that have the following:
* Above-average values for population
* Above-average values for area

In [9]:
%%sql
SELECT name, population, area
  FROM facts
 WHERE name <> 'World'
   AND population > (SELECT AVG(population)
                       FROM facts
                      WHERE name <> 'World')
   AND area < (SELECT AVG(area)
                 FROM facts
                WHERE name <> 'World');

 * sqlite:///dataset/factbook.db
Done.


name,population,area
Bangladesh,168957745,148460
Germany,80854408,357022
Iraq,37056169,438317
Italy,61855120,301340
Japan,126919659,377915
"Korea, South",49115196,99720
Morocco,33322699,446550
Philippines,100998376,300000
Poland,38562189,312685
Spain,48146134,505370
