# Intro to SQL : Databases

- Pandas csv has downsides including: 
    + it's static 
    + loadtime is long
    + data is stored in memory (thus cannot work with data > 8G)
- SQL on the other hand:
    + loads the data on disk and not memory
    + is dynamic

## Connectiong and Querying
- SELECT : select without modification  
- FROM : dataset to query
- WHERE (col, operator, value) : conditional statement
    + has to come after SELECT and FROM
- LIMIT : limit number of returned results
- OR / AND : logical operators could be used in conditional statement
- ORDER BY 'col name' [ASC or DESC] : order column by ascending/descending order
    + nested order by orders by the first column and then the 2nd column (ex. last name , first name)
- GROUP BY : analogous to pandas sort_values
- HAVING : same as WHERE except HAVING is used when the condition value isn't originally in the db (eg column arithmetics)
- ROUND(var, # decimal) : rounds to the nearest int
- as : renames column as desired

In [1]:
import sqlite3

In [2]:
# Create a connection instance
conn = sqlite3.connect('factbook.db')

**To execute a query, SQL query needs to be executed as a string. **

**This can be done using the Cursor class:**
- run a query against the database.
- parse the results from the database.
- convert the results to native Python objects.
- store the results within the Cursor instance as a local variable.

In [3]:
c = conn.cursor()
c.execute('select * from facts;')
c.fetchmany(2) #fetchall, fetchone also an option

[(1,
  u'af',
  u'Afghanistan',
  652230,
  652230,
  0,
  32564342,
  2.32,
  38.57,
  13.89,
  1.51,
  u'2015-11-01 13:19:49.461734',
  u'2015-11-01 13:19:49.461734'),
 (2,
  u'al',
  u'Albania',
  28748,
  27398,
  1350,
  3029278,
  0.3,
  12.92,
  6.58,
  3.3,
  u'2015-11-01 13:19:54.431082',
  u'2015-11-01 13:19:54.431082')]

## Using Pandas to work with sql db

In [4]:
import pandas as pd

df = pd.read_sql_query('select * from facts', conn)
df = df.dropna(axis=0)
df.head(2)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,created_at,updated_at
0,1,af,Afghanistan,652230.0,652230.0,0.0,32564342.0,2.32,38.57,13.89,1.51,2015-11-01 13:19:49.461734,2015-11-01 13:19:49.461734
1,2,al,Albania,28748.0,27398.0,1350.0,3029278.0,0.3,12.92,6.58,3.3,2015-11-01 13:19:54.431082,2015-11-01 13:19:54.431082


Write a function that takes in the initial population and the growth rate of a country, and outputs the final population.   
The annual population growth (expressed as a percentage) for each country is in the population_growth column. The initial population is in the population column.  
The formula for compound annual population growth is N=N0e(rt)N=N0e(rt), where NN is the final population, N0N0 is the initial population, ee is a constant value you can access with math.e, rr is the rate of annual change, expressed as a decimal (so 1.5 percent should be .015), and tt is the number of years to calculate for.  
Assume that you'll be starting in January 2015, and you'll be ending in January 2050, or 35 years.
Let's say you have a country with 5000 people, and a 4 percent annual growth rate. The formula would look like N=5000∗e(.04∗35)N=5000∗e(.04∗35).
Use the apply method on Pandas Dataframes to compute the population in 2050 for each row in the data.
Use the Dataframe sort_values method to sort on the 2050 population in descending order.
Print the 10 countries that will have the highest projected populations in 2050.

### Estimating the population in 2050 and displaying 10 countries with the highest estimated population

population in 2050 = population \* exp(pop_growth_rate * # years)

In [33]:
from math import *

def popcalc(row):
    pop= row['population']*exp(row['population_growth']/100.*35)
    #print pop
    return pop

    
pop2050 = df.apply(popcalc, axis=1)
sorted_pop=sorted(pop2050, reverse=True)
print sorted_pop[:10]

df['2050pop']=pop2050
print df.sort_values('2050pop', ascending=False).head(10)

[1918414568.4858003, 1600752082.0139306, 427989003.1189267, 422246629.08864832, 353241773.49396241, 331867609.77871519, 295789677.88354111, 267439339.32830244, 187107846.63657495, 183986320.75522664]
      id code                               name       area  area_land  \
76    77   in                              India  3287263.0  2973193.0   
36    37   ch                              China  9596960.0  9326410.0   
128  129   ni                            Nigeria   923768.0   910768.0   
185  186   us                      United States  9826675.0  9161966.0   
77    78   id                          Indonesia  1904569.0  1811569.0   
131  132   pk                           Pakistan   796095.0   770875.0   
13    14   bg                         Bangladesh   148460.0   130170.0   
23    24   br                             Brazil  8515770.0  8358140.0   
39    40   cg  Congo, Democratic Republic of the  2344858.0  2267048.0   
113  114   mx                             Mexico  1964375.0 

## Computing total area of land and water through SQL query 

In [49]:
area_land = c.execute('select sum(area_land) from facts where area_land!=""').fetchone()
area_water = c.execute('select sum(area_water) from facts where area_water!=""').fetchone()

print area_land
print area_water

print float(area_land[0])/area_water[0]

(128584834,)
(4633425,)
27.7515734041
