The pandas workflow works well when:

* the data fits in memory (a few gigabytes but not terabytes)
* the data is relatively static (doesn't need to be loaded into memory every minute because the data has changed)
* only a single person is accessing the data (shared access to memory is difficult)
* security isn't important (security is critical for company scale production situations)

When the data;

* changes frequently, 
* requires shared access, 
* doesn't fit in memory, and 
* security is critical, 

a database is a much better solution.

A database is a data representation that lives on disk that can be **queried**, **accessed**, and **updated** without using much memory. We primarily interact with a database using a [database management system](https://en.wikipedia.org/wiki/Database) or **DBMS** for short.

In the pandas workflow, we spend most of our time thinking about what functions and methods to use, where to store intermediate results in variables, and juggling all of these. To work with data stored in a database, we instead use a language called **SQL** (or structured query language). In SQL, we express each unique request (whether it be **fetching** a subset of or **editing** values in the data) as a single query and then ask the **DBMS** to run the query and display any results.

For example, to fetch a specific subset of the data from a database, we would:

* write the SQL query: `SELECT * FROM salaries;`
* ask the **DBMS** to run the query and display the results to us

Because the data lives on disk, we can work with datasets that consume multiple terabytes of disk space. Many data science teams in industry have servers and setups in **cloud environments** like **Microsoft Azure** or **Amazon Web Services (AWS)** that let team members work with this scale of data. 

Robust and popular **DBMS** tools like **Postgres** and **MySQL** include powerful features for managing **user credentials**, **security**, and **high data throughput** (quickly changing data).

We'll learn the fundamentals of SQL using a small, portable **DBMS** called [`SQLite`](https://www.sqlite.org/index.html). SQLite is the most popular database in the world and is lightweight enough that the **SQLite DBMS** is included as a [module in Python](https://docs.python.org/3.6/library/sqlite3.html).

We'll explore data from the **American Community Survey** on job outcome statistics based on college majors. While the original CSV version can be found on [FiveThirtyEight's Github](https://github.com/fivethirtyeight/data/tree/master/college-majors). 

We'll be using a slightly modified version of the data that's stored as a database. We'll be working with the bit of data that contains the 2010-2012 data for recent college grads only. We'll learn how to write SQL queries to explore and start to understand the dataset.

We've loaded the dataset on job outcome statistics into a database. A database usually consists of multiple, related tables of data. Each table contains rows and columns, just like a CSV file. We'll be working with the database file `jobs.db`, which contains a single table named `recent_grads`.

In [8]:
import pandas as pd
import numpy as np
import sqlite3 as sql

database = "jobs.db"
conn = sql.connect(database) # Creating connection to database file

In [9]:
query = '''SELECT * FROM recent_grads LIMIT 5'''
# c = conn.cursor() # create a Cursor object 
# c.execute(query)
# print(c.fetchone())

In [12]:
pd.options.display.max_columns = 150
pd.read_sql_query(query, conn)


Unnamed: 0,index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564,1976,1849,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.101852,640,556,170,388,85,0.117241,75000,55000,90000,350,257,50
2,2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037,648,558,133,340,16,0.024096,73000,50000,105000,456,176,0
3,3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313,758,1069,150,692,40,0.050125,70000,43000,80000,529,102,0
4,4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341631,25694,23170,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972


Understanding of what each column represents, here are some questions we may have:

* Which majors had mostly female students? Which ones had mostly male students?
* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?
* Which engineering majors had the highest full time employment rates?

In [15]:
# Creating a function to read query
def read_query(query):
    return pd.read_sql_query(query, conn)

In [19]:
# To determine which majors had female students in minority
query = '''SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen < 0.5 '''
read_query(query).head()

Unnamed: 0,Major,ShareWomen
0,PETROLEUM ENGINEERING,0.120564
1,MINING AND MINERAL ENGINEERING,0.101852
2,METALLURGICAL ENGINEERING,0.153037
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,0.107313
4,CHEMICAL ENGINEERING,0.341631


Most database systems require that the `SELECT` and `FROM` statements come first, before `WHERE` or any other statements.

To filter rows by specific criteria, we need to use the `WHERE` statement. A simple `WHERE` statement requires three things:

* The column we want the database to filter on: `ShareWomen`
* A comparison operator that specifies how we want to compare a value in a column: >
* The value we want the database to compare each value to: 0.5

Here are the comparison operators we can use:

* Less than: `<`
* Less than or equal to: `<=`
* Greater than: `>`
* Greater than or equal to: `>=`
* Equal to: `=`
* Not equal to: `!=`

In [23]:
query  = '''SELECT Major, Major_category, Median, 
ShareWomen FROM recent_grads WHERE ShareWomen > 0.5 AND Median > 50000'''

read_query(query)

Unnamed: 0,Major,Major_category,Median,ShareWomen
0,ACTUARIAL SCIENCE,Business,62000,0.535714
1,COMPUTER SCIENCE,Computers & Mathematics,53000,0.578766


In [26]:
query = '''SELECT Major, Median, Unemployed
FROM recent_grads WHERE (Median > 10000) OR (Unemployed <= 1000) LIMIT 20'''
read_query(query)

Unnamed: 0,Major,Median,Unemployed
0,PETROLEUM ENGINEERING,110000,37
1,MINING AND MINERAL ENGINEERING,75000,85
2,METALLURGICAL ENGINEERING,73000,16
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,70000,40
4,CHEMICAL ENGINEERING,65000,1672
5,NUCLEAR ENGINEERING,65000,400
6,ACTUARIAL SCIENCE,62000,308
7,ASTRONOMY AND ASTROPHYSICS,62000,33
8,MECHANICAL ENGINEERING,60000,4650
9,ELECTRICAL ENGINEERING,60000,3895


In [29]:
query = '''SELECT major, Major_category, ShareWomen, Unemployment_rate 
FROM recent_grads
WHERE (Major_category = "Engineering") AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)''';

read_query(query)

Unnamed: 0,Major,Major_category,ShareWomen,Unemployment_rate
0,PETROLEUM ENGINEERING,Engineering,0.120564,0.018381
1,METALLURGICAL ENGINEERING,Engineering,0.153037,0.024096
2,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313,0.050125
3,MATERIALS SCIENCE,Engineering,0.31082,0.023043
4,ENGINEERING MECHANICS PHYSICS AND SCIENCE,Engineering,0.183985,0.006334
5,INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.343473,0.042876
6,MATERIALS ENGINEERING AND MATERIALS SCIENCE,Engineering,0.292607,0.027789
7,ENVIRONMENTAL ENGINEERING,Engineering,0.558548,0.093589
8,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.750473,0.028308
9,ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174123,0.033652


The results of every query we've written so far have been ordered by the `Rank` column.If we modify the above query to include the `Rank` column, we'll notice that the results are ordered by the `Rank` column as well:

We can specify the order using the `ORDER BY` clause.This will return the results in **ascending order** (increasing) by the column:

If we instead want the results ordered by the same column but in **descending order**, we can add the `DESC` keyword:

In [34]:
query = '''SELECT Major, ShareWomen, Unemployment_rate 
FROM recent_grads WHERE ShareWomen > 0.3 AND Unemployment_rate < 0.1
order by ShareWomen DESC LIMIT 10'''
read_query(query)

Unnamed: 0,Major,ShareWomen,Unemployment_rate
0,EARLY CHILDHOOD EDUCATION,0.967998,0.040105
1,MATHEMATICS AND COMPUTER SCIENCE,0.927807,0.0
2,ELEMENTARY EDUCATION,0.923745,0.046586
3,ANIMAL SCIENCES,0.910933,0.050862
4,PHYSIOLOGY,0.906677,0.069163
5,MISCELLANEOUS PSYCHOLOGY,0.90559,0.051908
6,HUMAN SERVICES AND COMMUNITY ORGANIZATION,0.904075,0.037819
7,NURSING,0.896019,0.044863
8,GEOSCIENCES,0.881294,0.024374
9,MASS MEDIA,0.877228,0.089837


In [36]:
# Which engineering majors had the highest full time employment rates?

q = '''select Major_category, Major, Unemployment_rate 
    from recent_grads
    where Major_category = "Engineering" or Major_category = "Physical Sciences"
    order by Unemployment_rate ASC''' # Each line is a separate clause. whole query is a statmenet. AND & OR are operators. DSC & ASC are keywords
read_query(q)

Unnamed: 0,Major_category,Major,Unemployment_rate
0,Engineering,ENGINEERING MECHANICS PHYSICS AND SCIENCE,0.006334
1,Engineering,PETROLEUM ENGINEERING,0.018381
2,Physical Sciences,ASTRONOMY AND ASTROPHYSICS,0.021167
3,Physical Sciences,ATMOSPHERIC SCIENCES AND METEOROLOGY,0.022229
4,Engineering,MATERIALS SCIENCE,0.023043
5,Engineering,METALLURGICAL ENGINEERING,0.024096
6,Physical Sciences,GEOSCIENCES,0.024374
7,Engineering,MATERIALS ENGINEERING AND MATERIALS SCIENCE,0.027789
8,Engineering,INDUSTRIAL PRODUCTION TECHNOLOGIES,0.028308
9,Engineering,ENGINEERING AND INDUSTRIAL MANAGEMENT,0.033652
